Speech recognition system for Home automation.

S

Soren 18 years ago

Hi,

I'm looking at possibilities of constructing a relatively cheap home speech recognition system to turn appliances on and off. As i see it I have several options:

- Use a computer: Pros: Good Software is easy to get, Probably the fastest way to get a system. Devices could be run from the serial/ parallel port. Cons: Expensive solution for such a "simple task". Computer must be dedicated and run all the time = high power consumption = expensive.

- Build a system from speech recognition IC's or Microcontrollers: Pros: Components are relatively cheap, Low power consumption, Good recognition rate Cons: Components are often SMD's (haven't found any that are not) which means specialized equipment is to be used. Demo/development boards costs too much to justify its use. Maybe long development time.

- Buy Specialized solution: Pros: Plug'n'play Cons: Extremely expensive, removes the fun of DIY :)

What I really need is a robust recognition system that works with microphones not placed directly in front of the speaker.. like in the corner of the living room.

Does anyone here know of such a system, or tried to develop one? It only needs to recognize 5-10 words. It's really a "toy" project.

Thanks, Soren

Vote

R

RickH 18 years ago

Ideally I'd like to see Insteon keypad/paddle switches with a mic and local voice recognition that allow you to enter a few dozen words that simply map to the hard switches, for dual voice/finger control. All self contained in a voice chip, no computer. If you can make one of those, I'd buy it. It should also of course have a way to disable/ enable voice control for the switch, and it should at least be trainable to support the voices of 4 different persons. So a typical

8 button keypad would need room for 40 voice commands (5 per switch)

click (toggle) on (explicit on for convenience) off (explicit off for convenience) holddown (for dim brt) letgo (to end dim brt)

(or whatever voice commands make sense for what your fingers would be doing)

times 4 people, so 160 programmable voice commands.

With switches like this it would be a snap to make an entire home voice controllable, as the numerous localized microphones would increase dependability with ambient noises and be more intuitive as to what part of the wall/room you need to talk to.

The deluxe model of such a switch can also contain a proximity sensor that will only enable "listening" when there is a human within 10 feet or so (adjustable).

These would be great for handicapped homes, etc where getting to the switch alone might be a chore. Downstream loads can then respond to normal Insteaon or X10 switch mappings.

Can you make one of these given that Insteon is a licenseable chip and you already seem to have a handle on the voice chip?

Vote

R

rtandems 18 years ago

I tried voice recognition using Mr House on a PC that I leave running all the time anyway. It worked OK, if you were directly in front of the mic. After some experimentation, I decided the only way I was going to get a "whole house" solution was to use a "StarTrek" type communicator.

-Brian

Vote

D

D&SW 18 years ago

I don't know that speech recognition is "a simple task". Try googling and ebay'ing for "Mastervoice" or "Butler-in-a-Box", old technology but worked well.

Vote

R

Robert Green 18 years ago

It's simple - in a lab. And it's readily available, too. I checked Google

formatting link

and there's a whole host of solutions out there. The biggest problems, IMHO, are that some lifestyles and homes are not compatible with voice commands. If you have music or TV playing in the background, the problem gets pretty complex and requires the speaker be very close to the microphone.

Probably the best method currently would be to use a cordless phone system to link to the speech controller since most people have phones in multiple rooms these days. Not quite the hands free technology-as-magic but probably far more reliable than any current roomwide mic setups. People have been able to get such roomwide installations to work, but it's a lot of effort and still isn't immune (or as immune) to loud ambient noise as a good phone mic placed close to the speaker's mouth.

It's possible to mute the electronic audio or attempt to mute it via noise cancellation, but I think the voice commanded home is still a bit off in the future. It's a little bit like the leap that was supposed to occur nearly

30 years ago when holograms were invented. The pundits predicted that in a few short years, all photos would be holographic and all movies made in 3D. Aside from a segment in the movie "Logan's Run" I haven't seen much progress in the world converting to 3D imaging. (-:

Eventually, we'll be able to separate the loud background noise from the speaker's voice - hell, my *dog* can do it right now! More likely we'll all be wearing invisible headphone and mics that tie us into the net, our phones and our house, but it's taking longer than expected, at least from where I'm sitting. But it's starting. I see a lot more people every time I go out with Bluetooth earpieces stuck in their ears.

-- Bobby G.

Vote

S

Soren 18 years ago

Research has already reached a point where you can get very good speech recognition down to a SNR of 10 dB. This article got around 80% correct for a 10 dB SNR

formatting link

ok it's in the lab, but they also tested different noise types, like restaurants, airports, etc. Now, noise at home is rarely at the level of a restaurant.. or an airport :) (at least if you don't have kids) Theres no doubt in my mind that signal processing algorithms will reach a level where speech recognition will even work well in heavy noise, and it's not in the too distant future. Of course, beating the human hearing is not around the corner.. if you can't understand what a person is saying while the TV is on, then most likely neither can the computer. There are already home speech recognition systems, like the butler-in-a-box, that provide noise robust speech recognition to some extent. I'm certain that the algorithms today have reached a level that makes robust speech recognition in your home, with moderate noise possible, and i'm not surprised if such systems already exists and works well. I just want to see how good/bad a REALLY cheap DIY system works :)

Soren

Vote

R

Robert Green 18 years ago

Lab results for SR or VR have to be taken with a grain of sand. In the real world, people slur their speech for more than they ever realize. Probably a lot more than researchers do. I doubt these are double blind tests, either. The researchers are often on the development team and have a bit of a bias. But I don't think that's news to you.

I've used at least 5 different incarnations of Dragon Dictate and other similar programs over the years and the improvement has been phenomenal, but I'm also aware that I speak a very different way to the dictation mike than I do to another human being. The - spacing - between - each - word - is - very - pronounced - and - my - wife - certainly - is - not - pleased - when - I - use - the - same - halting - speech - on - her. Why do I talk to DD that way? Well, while I was training DD, it was also training me! I learned that the best way to avoid misrecognition was to - speak - like - this - and - annunciate - very - clearly.

The 80% figure they cite means one out of every five words is misheard. For dictation, that's not even close. Even at 95+% recognition for Dragon, I still find myself unhappy at the amount of manual correcting I have to do. For home automation, I don't think I'd accept every 5th command going unheard, or worse, misheard and the wrong action taken.

I agree wholeheartedly. I think home SR and VR (word recognition vs. actually ID'ing the speaker's voice) will both be along soon, but the improvements will come from better, faster, smarter, lower power CPUs, cheaper memory and algorithm improvements based on the feedback developers gain from studying large corporate systems.

I use a number of SR-based services, and some of them are quite good at natural speech processing. But most of them, like my pharmacy refill system, are restricted to a very, very narrow set of commands, usually 0 through 9, the pound and the star key, and sometimes "yes" or "no." Whether that's because these were simply fast ports from touch tone customer response systems or because recognizing only those commands boosts reliability tremendously, I can't say.

I'd be more surprised than you to find such systems working well outside the lab or without the kind of computing horsepower that puts the cost or size or complexity outside the reach of your requirements.

There are a lot of reasons working against it, and those reasons are pretty common in the tech arena. When early adopters take on a technology and it seems "almost there" but never quite "all the way" those technology leaders can actually become brakes on large-scale acceptance of the technology.

It's very similar to LCD TV's and CFL bulbs. The earliest revisions of these two technologies worked, but with lots of warts. Early LCDs TV were dim, had refresh rate issues, low contrast and narrow viewing angles. So the "stink" got on LCD TVs and the better, but in their own way troublesome, plasma TVs surged ahead. The newest LCD panels coming out of equally new huge factories

formatting link

are light years ahead of those produced just two or three years ago. But the stink will stay on them for quite a bit longer because of the poor performance of the first efforts. Part of the LCD problem was that their performance was tied to another relatively new technology, the CFL. Both technologies have matured quite a bit. Even so, some LCD TV makers have switched to LED backlighting to overcome the few, though gnarly, remaining issues with CFLs.

formatting link

What does this have to do with sound recognition? It's still not mature enough to reach the point where the "stink" of early failures doesn't follow it. Yes, you can make it work but you have to "want to" - it's not going to be bulletproof out of the box without some adjustment on the part of the user.

I hope you find what you're looking for and if it works, share the results with us. The time for cheap, reliable and standalone SR is fast approaching. My new $400 HD LCD TV with 7 inputs including VGA has convinced me that LCD HD TV's have arrived at a price point that will make the abandonment of old CRT TV's pretty painless. Driven by a fairly new PC with a ATI Radeon 7000 with a puny 32MB it produces some of the best still and video images I've ever seen on a large screen. It wasn't until I went out and actually looked at the very newest models that I overcame my own prejudice about LCDs based on recent, but not absolutely current, experience.

LCD technology has mostly cleared the hurdles that plagued early products, I'm hoping home SR will get there, too, without requiring the use of tiny tracking shotgun microphones in every room or a permanent Star Trek communicator badge. Ironically, though, I think the resolution of the problem *will* be the badge because so many other technologies are converging on the endpoint of wireless connection between the electronic world and a person's eyes, ears and mouths.

-- Bobby G.

Vote

R

Robert L Bass 18 years ago

I sure wish the SR used by my GPS system was that good. If there's any extraneous noise at all it either doesn't respond or does the wrong thing. There are also a few humorous recognition errors it makes. While experimenting with it I once used a certain vulgar epithet suggesting one perform an anatomically impossible feat. The machine responded, "Would like to find ... a hotel?" :^)

Vote

R

RickH 18 years ago

Whatever you do, the quality of the microphone is most critical. I would suggest using the Crown pressure zone mics:

formatting link

A cheap little condenser mic capsule from radio shack (if they even sell them anymore since they are now a damn cell phone store) would probably not cut it.

Vote

D

Doug 18 years ago

So this bloke had a new car with fully automated SR controls, he had a stutter which was brought on by words starting with ST such as start & stop so he set it up using "Jesus" for start and "God" for stop. One day while showing it off to a friend he for forgot the word for stop and the car was careering towards the edge of a cliff, in desperation and panic he shouted "oh God we're going to die" , the car recognised the command "God" and in the nick of time stopped on the edge of the cliff. Jesus, that was close said his friend

Yeh, I know its an old joke

Doug

Vote

R

Robert Green 18 years ago

But one I haven't heard! I just deployed a Sanyo GPS I got last week new for under $200 and while the trip out was flawless the trip back was marred by a user interface designed for the Egyptians. No text labels, just teensy-weensy icons and counter-intuitive touch screen functions until I, too, found myself using the word "Jesus" as in "Jesus, didn't they test this thing on real people?" I'm sure after a few hours with the instruction manual on CD I'll get the hang of it, but today I was ready to fling it out the sunroof.

I had been looking at more expensive models with more readable screens but in DC there's been an epidemic of GPS thefts (they apparently look for suction cup marks on the windshield to know which cars to rob - a word to the wise!). I can live with a $200 loss, but I'd hate myself if I lost a $500 unit. Although the screen's not the easiest to read in sunlight, the voice prompts are loud enough to hear, and there's an earphone jack (and MP3 player - although I don't think they work simultaneously).

As I was driving, I thought: "maybe the hard-to-see screen is a blessing in disguise since I won't be tempted to try to read it while driving." Since I can just slide it in my shirt pocket, that's perhaps the most theftproof mounting available. I've already had my "stealth" faceplace CD player get pryed out of the dashboard. That made me say "Jesus" too. )-:

-- Bobby G.

Vote

S

Soren 18 years ago

I've actually worked on some SR when I did my thesis. Usually you have large databases with many people saying the same words, some researchers make their own, others download the available ones on the net. I've read a few where they'd only use a single database, but the serious ones use utterances from different databases to prove the robustness of the algorithm. Slurring of speech is really common and It can really be a problem. I've listened to many utterances of different sentences, and if you did not know the exact words they spoke, you could actually come in doubt yourself. Especially if the sentence was just random words. Thats a huge problem with SR, when we as humans hear a mumbled word, we might not notice it at all, since the brain perfectly understands the context of what was said, and can "guess" the correct word, even though it actually sounds like 5-10 similar words. Its a bit like the old famous "you dnot hvae to wirte the ltteres in the crroect oderr, the bairn sees the wrod as a wolhe". Teaching a computer to understand context is an enormous task.. maybe possible in a short sentence, but in an entire conversation!? Not today, That's the next step :)

As for bias, yes it certainly happens. One should always make an effort to spot weaknesses in articles, but it can be very tricky unless you've actually done the work. I've been fooled a couple of times. :)

I've read somewhere (sorry, no references) that some of the newest SR software is able to detect words in fluent speech. I bought the HM2007 chip which is old and discontinued, I will definitely have to speak very clearly and with "large" spacing if I wanted to say full sentences, 1-2 secs in between words I believe the datasheet said. But what I am after in the beginning is really just a robust on/off SR switch. The HM2007 has a 40 word memory. You could train "on" and "off" 20 times each. The more the better, as is also shown in the article, the more they utterances the average over, the better the error rate becomes. They did an average over 300 utterances and get

99% accuracy in some cases.

Exactly, for a robust system that people would actually want to use in their daily lives you'd almost need 99% accuracy. Not there yet, but getting there :) False triggering can be avoided somewhat, by using special sequences of words, that are not too similar.. that'll be my initial approach. Something in the lines of a trigger word.. and then the command.

Recognizing only a very limited set of commands does boots reliability, and keeps costs of stand alone systems to a minimum.

I believe that any really robust system, would probably be very complex and expensive.. I don't know how well the mastervoice (butler- in-a-box) works.. but at a price tag of 3000$, i hope it works very well. It's probably also a full PC with some attachments. A cheap robust system? Robustness is a highly valued quality in SR or VR, and people would be willing to pay big bucks for those extra 5-15% improvement in errorrate. My hope is that you could construct a simple system with very few features, that works reasonably well.. I'll share my errorrate when I get there :) A bit like the "clapper" on/off switch that was so popular in the 80's, only more robust :)

I think thats a good comparison, and I couldn't agree more.

Thanks :)

Interesting point.

As a matter of fact, I'm still not really decided on LCD technology yet. Ok, I switched from my old CRT to a 22" Wide LCD for my computer, and it looks great! But I'm kind of wary of buying a large LCD TV. I still see old "LCD effects" in rapid camera movement, even on some of the new LCD's. And of course, to make matters worse, HD TV has not reached danish broadcasters yet. I think SED's looks promising.. but LCD's can probably improve tremendously in the time it takes for SED TV's to reach todays LCD prices.

Regards, Soren

Vote

R

Robert Green 18 years ago

You've hit on the key: context. The later versions of Dragon Dictate did have a fairly good understanding of context. You could actually see it making contextual corrections on screen with a fourth or fifth word in sentence causing the program to "change" what it had originally "heard" and displayed in the edit box for first few words. The addition of context sensitivity made for remarkable improvements in recognition and it really brought home to me how the human brain works. You can see it when you start to talking to someone about a totally new subject from a previous one. There's a moment when they "catch up" to the subject change and the "ah ha" experience takes place.

There are all sorts of traps to look out for in studies. This week the LA Times had a remarkable article about epidemiology and a study that "proved" that Canadians under the star sign Sagittarius are 38% more likely to break their leg than other star signs:

<

formatting link

>

( aka

formatting link

"SAGITTARIANS are 38% more likely to break a leg than people of other star signs -- and Leos are 15% more likely to suffer from internal bleeding. So says a 2006 Canadian study that looked at the reasons residents of Ontario province had unplanned stays in the hospital."

Of course, that's probably not really true - although it might be - because of the quirks of the way the numbers are crunched. A friend suggested a possible chain of causation. It's the time of year when the weather first turns cold in Canada and people celebrating their birthdays out drinking might find themselves walking or driving on ice. Later in the year, people are more likely to wear boots or shoes with non-slip soles. What troubled me most about the articles is that they were able to insert a magical corrections to smooth out the numbers. If you can "fix" outcomes you don't like, how valid is the entire process?

I wonder what happens if you use all the slots to train the words ON and OFF under all the likely noise conditions you'll encounter? I'd love to be able to control at least some of the lights by voice, especially the switches I am likely to encounter with my hands full of tools or laundry or dogs. But the reality is that a large paddle switch that I can operate with my elbow will probably be more reliable, over all.

I was considering using my cordless phone system as an input to the speech control but the dilemma was obvious. If my hands were so full that I couldn't operate a wall light switch, picking up a cordless phone, punching some keys and THEN speaking the command wouldn't make sense. It's like the $3.76 wall clock I bought from Walmart today. The box says "for warranty service send the unit back, prepaid along with a $5 check." Uh huh. (-: They should also request a certificate of stupidity!

That sounds like a good approach. I thought about using whistles, clicks, or a simple loud "Hey!" which seems to work well on the dog, even in noisy conditions.

Agreed. I've hung up on some of the more ambitious systems because they got so far out of whack compared to what I was trying to say that I couldn't get back to the beginning. The error recovery process was not very good. That particular company switched back to a numerically-based system, but they now use so many options per tier that it's still very hard to use. Not many people can remember the first few choices when a machine spits out ten different options in a row.

The modern multi-core CPUs have incredible processing horsepower compared to the 300MHz PC's that I started doing SR with. I use a 600MHz Pentium class machine without too much time like for voice dictation which I use when my bad hand tendons act up. It works well until I get a sore throat from speaking so much!

You've touched on something interesting. The clapper's success was probably due, in part, to the loud clapping of hands being a fairly distinct aural event, even in a noisy environment. The trick is to figure out how to create a similarly unique sound with your voice. A yodel, a wolf howl, something very distinct from normal human speech yet not so weird that your neighbors will call the local insane asylum might work.

I've been ambushed by that phenomenon more than once and probably will be again!

I'm pretty sure that's where we are going to end up. Phones and ear pieces keep getting smaller and smaller. Eventually they'll be implants although some recent studies have implied that implanted micro-electronic devices may increase the risk of cancer. Sorry, I'm too tired to look that up, but I believe it was in the NY Times Science section for anyone remotely interested in following up.

I did virtually exactly the same thing. I got a 22" wide LCD and it looked so good compared to my 4 year old laptop that I decided that LCD's had indeed improved tremendously in the last few years, as Lewis G. had suggested. $400 later and I own a no-name 32" LCD TV that looks remarkable with no noticeable ghosting and very vivid colors and contrast. And since it has a VGA input (and 6 others, including HDMI, component, composite and CATV), I can pipe anything I own that produces video into it. Programs like Stargate Atlantis, which are filmed in HD, look outrageously good at

1080. One thing's for sure about HDTV. Prop masters and makeup artists are going to have to work a lot harder now that everything is under a magnifying glass.

-- Bobby G.

Vote

S

Soren 18 years ago

Thanks for the link! I've been thinking about this alot.. I totally agree that the microphone is crucial for the success, especially if you are not directly infront of the microphone. This will definitely be the most expensive part of the system all togehter. I've been looking at good mikes, saw a couple on e-bay that i thought would maybe do it.. about 20$ i think.. I'll take a look into this pressure zone mic.

Regards, Soren

Vote

S

Soren 18 years ago

A quick google on voiced activated switch came up with a couple of results:

http://209.85.135.104/search?q=cache:gdbMcaVf3doJ:

formatting link

25$

formatting link

- 80$

So they are there, how well they work however, I don't know. 25$ is hard to beat. 80$ that includes a response can be done for less than

80$, DIY.

I think these products are good news for my project :D I'm looking forward to testing it.. now I just need some free spare time to assemble it!

Soren

Vote

S

Soren 18 years ago

I must admit, the microphones a cool! :) , but at a 200$ price tag for the cheapest ones, and above 1000$ for the medium quality series, its waaay above may budget for this project :D

Soren

Vote

M

Marc_F_Hult 18 years ago

[snip]

Soren,

I've played with this for years and conclude that for me, the failure rate is still usually too high. Depends in large part on one's tolerance level for mistakes and, especially, the need to repeat. I have 3-4 different Crown PZM models, and other mics, gated mixers and have tried most of the VR programs available through about 2005.

Some suggestions:

1) read this article about the "Myth of microphone reach"

formatting link

and this approach to predicting speech to background noise ratio:

formatting link

2) If you want to have more than one microphone, a gated mixer is a near necessity. I have a bunch of extra 8-In 4-0ut Ivie 784 and 884 mixers for sale on my personal Internet Home Automation and Porch sale here for about 7 cents on the dollar. The 884's have built-in mic preamps and auto gating.

formatting link

I also have some new/unused small flat Sony pzm-style mics that are excess to my needs that I will add to the Porch Sale for $5 each.

There is write up on the use of mixers in home automation here:

formatting link

This may be more tan you want now, but these projects have a way of growing ;-) It outlines a flexible, modest-cost approach to integrating all audio signals, input and output, used in HA including those for VR, announcements, background music, intercom AV and so on -- here's the important part for your need to control from across the room -- in a way that maximizes the likelihood of successful VR because all the other signals could be made to auto-mute when an initial VR command is received. Of course it can be made to do a lot more than that.

HTH ... Marc

Marc_F_Hult

formatting link

Vote

M

Marc_F_Hult 18 years ago

Hi Rick,

This would have to be built at very high volume to be commercially feasible. And as much as I like INSTEON (and use it, and am a registered developer) a direct-connection (electrical) hook to INSTEON dimmer or switch would not be covered under the existing code approval for the INSTEON device. That in itself is a deal-breaker even if there were a handy plug/patch/connection available (which I don't think there is).

A different approach that I've worked on for years is a home-brew lighting system that uses local motor-driven mechanical potentiometers for lighting control.

The latest version of the on-motor controller board has a pair of pins that control the UP / DOWN motor function that is intended for local, add-on control. That is to say, they are not needed by the centralized control system and can be used in conjunction with local sensors.

My immediate application is to add autonomous (not controlled by central PC or MCU) local ambient light control and local IR remote. And they would serve just as well for local control by RF signal or by speech.

The architecture of the system is such that regardless of whether changes to lighting level are made by rotating a knob, moving a slider, voice control, a local ambient light sensor, IR remote or RF signal, the central PC is updated about 40 times a second as to the actual dimmer control signal (+/- 0.5%). This is done with an analog signal that is easily trouble-shot and DMX (DMX512a; DMX-512) which is ubiquitous international standard for hard-wired entertainment (theatre, club etc) and architectural lighting. I don't know of any other system -- commercial or not -- that is as flexible as what I've designed for home use.

Of probably wider interest, I am also working on a potentiometer version that provides a _local_ AC dimmer retrofitted to unmodified, existing AC wiring using a UL-listed solid state dimmer/relay and thus is compliant with the National Electrical Code -- AHJ willing.

An existing switch (eg: X-10 WS467, or manual SPST toggle ) is replaced with a single or dual random phase, UL-listed solid state relay inside the existing switch box. Only two AC wires -- no neutral -- are required. The zero-crossing signal is derived remotely and so only a pair of low-voltage wires carrying the switch signal, optically isolated inside the SSR, is needed. As this geologist understands it, this configuration is fully

1996-and-later NEC compliant.

One way the potentiometer(s) w/could be mounted is to a new faceplate attached to the existing switch box with a blank (no opening) over the AC portion. The faceplate would be one gang wider that the existing plate. The pots would be on the added portion and fit into a cut-out in the wallboard to the left or right of the switch box depending (usually) on which side of the switch box the stud was. But no rearrangement of the AC wires or switch box itself would be needed. One would, of course need to run CAT5 to the switch, but that is typically much easier than replacing AC wiring.

These are planned for installation initially in my house in those locations where INSTEON cannot be used without running new AC electrical owing to the absence of a neutral conductor, and ultimately, to replace INSTEON.

I'll post some of the PCBs and descriptions on my web site.

... Marc Marc_F_Hult

formatting link

Vote

S

shahrinima 18 years ago

hai..i would like to ask..where i can find the HM2007?I also need that chip for my final year project.. thank u

regards, nima

Vote

M

Marc_F_Hult 18 years ago

formatting link

claims to have them in stock.

Images SI does ship internationally but I don't know if they ship to Malaysia.

This is an old chip. HMC was purchased by Elan which was purchased by Babel and then merged into Babel Technologies. There are no follow-on ICs or other hardware offerings as best I know. So I'd buy at least two, or none at all.

... Marc Marc_F_Hult

formatting link

Vote

Speech recognition system for Home automation.

Join the Discussion

Didn't find your answer?