GetHuman loses its founder, and its focus

  Posted by Laura Chumley on February 27, 2008

I see that Paul English, of GetHuman.com fame, or perhaps notoriety is a better term, is throwing in the towel. Remember him? Only a few years ago he led the angry mob in a fight against the gnarly evils of telephone automation, providing a list of ways to get around the IVR and to a human agent. Today he is apparently too busy to continue to champion the movement he started, and has turned it over to Walt Tetschner, a self-styled ASR specialist and industry curmudgeon. Walt publishes an online newsletter with slightly whimsical pans and plugs of IVR applications, as well as well researched articles on events and trends in the speech technology arena.

I first ran across him when I was reeling from an encounter of the hideous kind using a Social Security Administration self-service application. Hapless me, I just wanted to find out how to change my social security number to my married name. After 20 minutes of fumbled repeated attempts, I gave up and drove 45 minutes to the nearest office. It was a waste of my time and energy. And such frustration!

I am well schooled in my IVR responses. They are crisp and without disfluencies. But I was stuck in a revolving nightmare of broken steps, recursive paths, illogical phrasing and overwhelming bureaucratic traps. Walt gave it a less stinging review than I would have, but overall, had the same negative perception of the experience that I did.

Since then I have read Walt’s posts in various forums. It will be interesting to see what he does with GetHuman. And whether his vinegar rather than honey approach invigorates or alienates the VUI design standard movement.

300 Knot Club

  Posted by Mark Abramson on February 27, 2008

Well, yesterday it happened. I was cruising from Dallas to Atlanta at 17,000 feet, West of Atlanta near the Vulcan (VUZ) vortac. I picked up a hefty tailwind and managed to squeak out a 300 knot ground speed. That’s about 345mph. That’s a first for this plane! The picture shows the ground speed on the Garmin G1000 MFD; it’s a little out of focus due to a little chop. Fast is fun. ‘Nuff said.

300kt_200802264.JPG

Dewey Decimal in MMVIII (that’s Latin for 2008)

  Posted by Justin Simkavitz on February 19, 2008

Take a moment and picture a librarian. Did you picture a young, hip person with a cell phone in one hand and an iPod in the other? Probably not. Traditionally, libraries have been slow adopters of new technology. The Dewey Decimal system is still widely used in libraries today despite being 130 years old. A recent trip to my local library has me convinced that you need the assistance of a helpful librarian if you want to find a book in less than an hour. Things in the library world are finally starting to change due to increased public pressure to update antiquated technology. Libraries are now taking some practical steps to improve customer service. Many public libraries have moved at glacier speed when it comes to updating technology, but glaciers are melting much faster nowadays.

Libraries are no different than any other business that adopts a new process or technology, growing pains are inevitable. My trip to the library reminded me of an article I read last year.

As I recall, the library had recently deployed an automated IVR system that would place outbound telephone calls to remind people when books were past due. The gentleman in the article received a call from the local library and the message went something like “Hello this is the Bumble County Public Library. Judy The 1000th Dixon, our records indicate that Gone With the Wind is past due….” It is important to note that Mr. Dixon’s wife’s full name is Judy Melissa Dixon. In the library database, her name is probably stored as Judy M Dixon. Just in case you are not from ancient Rome, recall that M is the Latin symbol for 1000. One of four things is going on here:

  • 1000 Judy Dixon’s have library cards in Bumble county
  • If the Dixon’s have around 20 years between generations and the sequence of Judy Dixon’s is uninterrupted, Judy Dixon the first was born nearly 20,000 years ago. Amazing.
  • Mr. Dixon is quite the ladies man.
  • The most likely scenario is that the automated system has a little problem differentiating middle initials and Roman numerals.

Text to speech (TTS) technology has improved significantly in recent years with companies like Nuance providing cutting edge technology that greatly improves the user experience. In the library example, the use of TTS to read the name is completely justified because the system is reading dynamic text and it probably is not feasible to have every name recorded by a professional voice talent.

The Roman Numeral/ Middle Initial problem has many resolutions:

  • All of the data can be scrubbed so it adheres to the predefined format that the TTS engine expects (time consuming and not very reliable)
  • Logic can be applied within the application that parses the text, enforces business rules, and then reformats the string before sending to the engine (This would work)

When designing applications that use TTS technology, it is important to know how your TTS engine will behave in different scenarios. Often times the difference between a correct rendering of the text and a “bug” is a period or space. One TTS engine may read Judy M. Dixon as “Judy M Dixon” while another will read the text as “Judy the 1000th Dixon”.

Until the system is fixed, anyone in Bumble County with the middle initials I, V, L, X, C, D or M may want to avoid checking out books.

The opinions expressed in this blog are purely and personally those of myself, Justin; they are not the official views of Message Technologies.

My Intro

  Posted by Mark Abramson on February 17, 2008

Welcome to my first post! I’m Mark Abramson, CEO/CTO and co-founder of Message Technologies, Inc. (MTI).

My goal will be to discuss what’s happening in my professional life, which spans about 37 years (unbelievable) and includes over 35 years of practical experience using speech recognition, text-to-speech, and touch-tone systems. Yes, the technology has been around quite a while and I’ve seen and done a lot with it. I want to include observations and perhaps a few “pearls” I’ve discovered over the years of working with this technology and the people who develop, deploy and tweak it.

I guess I am officially a serial entrepreneur, although I wouldn’t call what I do “rapid succession.” I’ve been involved in six or so startups, made money on a few and broke even on a few. Overall, my track record has been good and my instincts have proven more right than wrong.

My blog may occasionally include some aviation items, since I am an avid private pilot flying a 2007 Columbia (now Cessna) 400SX. I’ve logged about 1,100 hours so far. For those who like alphabet soup, I am officially a PP, ASEL, AMEL, IA. I have high performance, complex, and tailwheel endorsements and I’ve had a little experience flying aerobatics. All this means I normally can fly single and multi-engine land-based aircraft in the clouds. One day I hope to get my seaplane rating.

So I have two passions: work and flying. Happy blogging.

P.S. I guess I need to deliver the standard disclaimer stuff like the opinions in my blog are my own and not that of my company. I assume full responsibility for the content of my blog and if you take issue with anything I write, please take it up with me.

What Lies Beneath

  Posted by Laura Chumley on February 17, 2008

New and old VUI designers alike are always looking for tips on how to improve their scripting. As with any other field of endeavor, there are conflicting opinions; dissonance and debate abound. In my experience, we are passionate in our arguments, intense in our rationalizations. Design isn’t just a dry, analytical laying out of the prompts; it is an emotional interweaving of technique and form, nuance and balance. I like being a part of such a group, artists working their magic, taking words and sound and crafting a personal interaction with the caller–Giuseppe creating Pinocchio, a real boy.

This, of course, requires two things—that the customer allows free and open discussion and implementation of the information given and paths chosen, and that we as designers keep our hearts and minds open to the evolving sophistication and needs of our target population.

On the customer end, there has to be give and take between the demands of marketing, branding, business requirements and usability. I’ll say it outright; there needs to be far more giving and far less taking than usually happens. VUI designers are often brought in after the initial requirements gathering has happened. Well meaning folks with expertise in other specialties within the customer’s company have already laid out scripting rules and language based on experience gleaned from bad or banal interfaces in the past, ensuring that more such experiences follow for the rest of us. Like lemmings, we are forced to continue that flight over the cliff of bad decisions into the sea of bad design.

And honestly, we ourselves have gotten into the ill conceived habit of using these same tired gambits over and over. Knowing so much better we are yet the worst offenders–whether by sin of commission or omission–we let ourselves be drawn down the paths of convenience, conformity, laziness and acquiescence.

Bruce Balentine of EIG posted the following in the Yahoo VUID group on 02/04/2008. He points the finger directly, and appropriately, at us.

“…I ascribe it to the somewhat small population of companies and people doing the implementation work. Since everyone used to work for someone else and “this is the way we did it then,” these kinds of ideas get inbred and then become dogma. It’s a kind of “convergence to a local minimum” like in neural networks or quantum systems. It takes energy to tunnel back out once we’ve converged.

I think the same thing is true of those unhelpful recovery techniques that continue to persist — “I didn’t recognize that, I didn’t get that, I didn’t hear you;” — and the exclamatory grounding expressions, “Got it! and “Great!” What happens is that everyone’s ear becomes accustomed to the sound of a given solution, and in the absence of any rigorous debate or viable alternative, it becomes “comfortable” and subsequently “invisible” to the design team’s ears. “This is just how these things sound, and we used to work for XYZ so we know best by definition, and these other proposed solutions sound a little “weird” or offputting — they couldn’t possibly be an improvement.” So our designs converge to a local minimum and it’s very hard to tunnel out…”

That same Yahoo VUID group has been grousing over these issues of late. Some of them I have been guilty of myself, shamefully. We have compiled a list of phrases never to be heard in modern, professional interfaces again. Let’s band together and make it happen, I say!

1 Please listen carefully as our menu options have changed.

2 For more information, please see our website at www.whatever.com.

3 Your call may be recorded for quality assurance purposes.

4 My name is Beth, your virtual agent.

5 Press 1 for English.

6 You can speak or press your answers to each question.

7 Sales pitches.

8 Menu options that go on and on.

9 Lengthy legal disclaimers.

10 It’s my fault, I’m sorry.

And I am sure you can think of more offenders.

When you are writing your script, eliminate the non-informational bits that interfere with the primary aim of the caller: to accomplish the task in the shortest possible, easiest way that she can. The virtue of automation is tarnished by embellishment. Self-service, like the drive-in window, should be fast, efficient and painless.

Outbound Calls in a SIP / VOIP Environment

  Posted by Lowell Clark on February 13, 2008

When making an outbound call using an automated system, it is likely that you will want to know the status of the call. Was the line busy? Was the call answered? Was it a person or an answering machine that answered the phone?

The term Call Progress Analysis (CPA) encompasses the answer to all of these questions. For a more formal definition of CPA, see the Executive Summary Section of the “Call Progress Analysis:Global Call API Usage and Protocol Configuration”.

In past years, it was common for automated systems to use a piece of hardware to communicate with the telephony network. When an automated system placed an outbound call, the hardware would normally handle the CPA processing and return the result to the software application. Dialogic was and still is a major player in this market.

In today’s SIP/VOIP and VXML environments, things have changed a bit. It is no longer necessary to have a piece of hardware in your computer to communicate with the telephony network. Most Voice XML platform vendors provide pre-connection CPA but not post-connection CPA. You can expect to see pre-connection CPA results similar to the following:

• Busy
• Ring No Answer (RNA)
• Special Information Tones (SIT)

These results are great for creating business logic around call attempts and call back times, but what about after the call is answered?

Let’s say that I want to create an outbound campaign which targets people while they are at home and will run an interactive Voice XML speech application when the call is answered by a human. However, if the call is answered by an answering machine, I just want to leave a message on the answering machine. How will I know what answered the phone if the Voice XML platform does not provide my Voice XML application this information?

Well, determining the length of the greeting when the phone is answered is one way to identify the answering party. Since this example campaign will be calling people at their homes, you can expect that a human will answer the phone very briefly. For example: “Hello” or “Smith residence”. If an answering machine were to answer the phone then you would expect a longer greeting. For example: “Thank you for calling the Smith residence. We are currently unavailable, but if you leave your name and number we will get back with you as soon as possible”. As you can see, the answering machine is quite wordy and we can use this to our advantage.

Here is some sample code that can be used to determine if the answering party is a human or not based on the length of the greeting. I do not make any claims that this solution will provide 100% accurate results, but from my experience, neither did the hardware solutions.

The solution starts by recording the callers greeting. When the greeting is complete, the length of the recording is analyzed to determine if a human answered the phone.

This specific example indicates that if the answering parties greeting is longer than 3.5 seconds (human_threshold), then the answering party must be an answering machine. All of the attributes for the record tag can be adjusted as well as the human_threshold value to change the experience.


<?xml version="1.0"?>
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml“>
<property name=”COM.VOICEGENIE.USECONNECTIONEVENT” value=”TRUE”/>
<var name=”IsHuman” expr=”‘true’” />
<var name=”duration” expr=”‘0′” />
<var name=”human_threshold” expr=”‘3.5′” />
<form id=”CPA”>
<record name=”recording” beep=”false” beginsilence=”3s” finalsilence=”400ms” mintime=”250ms” maxtime=”45s”>
<noinput>
<prompt> No Input </prompt>
<assign name=”IsHuman” expr=”‘false’” />
<goto next=”#message” />
</noinput>�
<filled>
<assign name=”duration” expr=”recording$.duration” />
<if cond=”duration > human_threshold”>
<assign name=”IsHuman” expr=”‘false’” />
<goto next=”#message” />
<else />
<goto next=”#message” />
</if>
</filled>
</record>
</form>
<form id=”message”>
<block>
<prompt>The recording duration was <value expr=”duration” /></prompt>
<prompt>Is Human equals <value expr=”IsHuman” /></prompt>
</block>
</form>
<catch event=”connection.disconnect.hangup”>
<log> event connection.disconnect.hangup fired</log>
<exit/>
</catch>
</vxml>

Figure 1.0 - This VXML snippet was written to run on a Voice Genie Voice XML platform (now Genesys).

This example also handles the following scenarios:

• If there is silence for the first 3 seconds of the call then a no input event will be thrown. (beginsilence)
• If the answering party’s greeting is less than 250ms then a no input event will be thrown. (mintime)
• When the answering party stops speaking for more than 400ms a filled event will be thrown. (finalsilence)
• If the answering party speaks for longer than 45 seconds a filled event will be thrown. (maxtime)

This VXML snippet was written to run on a Voice Genie Voice XML platform (now Genesys). Because of this, the following non-standard VXML attributes are used in the record tag.

beginsilence – “The time to wait, if no speech occurs, before throwing a noinput event.”
mintime – “If the duration of the recording is less than this attribute, then the recording is assumed to be empty and a noinput is thrown.”

I would love to know if anyone else has found a better or different solution to this problem.

Disclaimer: The information, ideas, and opinions expressed in this blog are mine alone, and do not necessarily reflect those of Message Technologies, Inc.

Dawn of the WUI

  Posted by Laura Chumley on February 10, 2008

Recently I read an article speculating that we would soon have identified the elements of canine speech. Yes, the secrets of doggie language have been revealed to us mere mortals. As the proud owner of three good looking and above average intelligence pups; I began to think about how we could now communicate, and what that might mean for our household. For example, should Mr. Buck notice that Milky Way’s cough has returned, he can immediately call the vet for a prednisone refill. Imagine…a WUI (Woof User Interface)…

Virtual Vet: Hello, thanks for calling Cherokee Animal Hospital. To continue in Canine, just say woof.
Mr Buck: Wwoof
Virtual Vet: Thanks, is your family one of our clients?
Mr. Buck: Woof
Virtual Vet: With whom am I speaking?
Mr. Buck: Wooooof
Virtual Vet: Ah, hello Mr. Buck. One moment while I look up your file. Are you calling for yourself or one of the other pets?
Mr. Buck: Woof woof
Virtual Vet: Milky Way, eh? Is he coughing again?
Mr. Buck: Woof
Virtual Vet: Alright, you can have Laura pick up his medication on her way home. Will there be anything else?
Mr. Buck: Wooffff
Virtual Vet: You’re welcome, good bye!

And with a stunning 43% accuracy rate, the woof recognition software would be comparable to speech recognition only a few years ago. How far we have come in such a short time! We started dabbling with speech recognition in the 90’s. It was dreadful. But soooo intriguing.

DragonSpeak 1.0 required hours of training—not just for the program to learn your voice, but for you to learn how to speak in a way it would recognize, for you to learn how to behave. Quirky. Unpredictable. Inaccurate. Slow. Today DragonSpeak 9.0 boasts 99% accuracy. 99%!

Now our esteemed CEO often uses it for casual email as well as contracts. Just kidding, Mark. Just email and white papers. And we have built a thriving VXML hosting business with enterprise level Genesys servers whose recognition capabilities will knock your socks off!

A 2007 presentation at The Radiological Society of North America stated that their research found that ASR (automated speech recognition) programs have exceeded the accuracy of human translation. Yes, recognizing and interpreting human speech was done better by a machine than a human. Read the article here. The best is yet to come. And we are ready!

Alright you enterprising entrepreneurs out there…who is going to hire me to write the first WUI?

One man’s hair is another man’s harrow.

  Posted by Laura Chumley on February 2, 2008

My mother spoke like Scarlett O’Hara, with an elegant, deep Southern drawl. She was exceedingly proud to be a 4th generation Atlantan, and her life was steeped in that tradition–charm and drama, drama and charm.

While she was not exactly a Luddite; she would eschew most things that smacked of modern technology. She accepted ball point pens only grudgingly, preferring the smooth ink spread of the fountain pen. She dreamed of debutante balls and ladies club meetings, magnolia perfumed encounters and genteel discourse. So of course, she bore a changeling—a redneck geek.

We were taught to speak precisely even as small children, with good grammar and crisp diction. So when I told my new nephew-in-law what I do for a living–script design for speech enabled applications–I was taken aback when he said “Oh, that is why you talk so funny!”, and then he blushed, stammering, “I mean, all proper sounding.” I talk funny? Man, the hillbilly family I recently married into is the one that talks funny!

I am learning a whole new language these days, using immersion techniques—do or die! Sure, they grew up only 50 miles from where I did, but believe me, there is as distinct a difference between my urban dialect and their rural one that it is as if we lived in different countries on opposite sides of the world!

Most of the time, I can now understand my husband now without asking him to repeat himself, at least too often. But the other day when we were removing the bush hog implement from the John Deere tractor, he told me to get the cutting hairs and put it on. OK, I figured I would find something that looked like a tangle of wires or something. I looked and I looked. Nothing fit the bill. Nor did I understand what value something like that would have for the garden, but hey, he is the farmer, so I tried.

“Hon? I don’t see it. “
“O’er thar.”

I looked over there. Nope. Lots of different attachable things, but no hair-like things.

“Ummmmm.”

He looked over at me with some impatience and indicated a long bar with large scalloped shaped discs in shortly spaced intervals. I smiled politely and dragged the thing over to the tractor. “And this is called…what?”

“Cuttin’ hare.”
“Hair?

Then he realized what was wrong, and yet once again started to laugh at me.

“Tractor harrow. We call it a hare ‘round heah.”

Ah, another illuminating moment. One man’s hair is another man’s harrow.