January 24, 2011

Intonations - Part 1: Sound Bytes

Human voice is hard to mimic or replace. No surprise then, that digitizing human speech (to near perfection) seems like a dream. Well, perhaps not? 

If you really think about it, it’s the modulation and intonations that are the hardest to grasp and reproduce. I mean - you, I and a million others could be saying the same sentence, but the way we say it (i.e. the tone) makes all the difference. I remember attending a lecture/talk few years ago, where the speaker demonstrated the same with a very simple example. I shall include it, for your benefit.

Let’s take a very simple sentence – She has my pen. A mere change on which word you stress changes the ‘meaning’ of that sentence. Stress the word in bold as you read the following (Remember, the sentence is the same).

She has my pen.
She has my pen.
She has my pen.
She has my pen.

I’m sure you get the point. And these many implications with just one factor: what word you chose to stress on. Imagine all the other permutations and combinations with a host of other factors! It could be anything from emotions and attitude, to the situation and how you feel. And, oh yes, how can we forget regional/national differences in accents, not to mention the cultural differences? And also, sometimes the silences convey more than words. What then?

Another, pretty common, instance would be when we make a sentence sound like a question. Let’s take the same sentence - She has my pen. Some probably mean to enquire whether she has your pen. But only the modulation (raised inflection at the end) suggests that it’s a question. It’d be probably written as - She has my pen? Correct English would demand that the sentence be constructed as – Does she have my pen? But as you can see, many-a-times we don’t bother; and maybe humans are better at understanding what was implied, rather than a computer. 

And I’m sure that all the men folk out there agree that women have perfected this art THE best - manipulating what they say (or what they mean by what they say)! Yea yea, all those forwards on women meaning ‘no’ when they say ‘yes’ and vice versa…. (Well, we are pretty darn good at it, aren’t we?). 

But things are said to be changing and Googling gave me results on IBM’s success (the closest,I believe); how voice interfaces could improve (?) human-computer interaction; the possibility of synthetic voices affecting ‘voice-acting’ jobs (yes, there are such jobs!); voice and speech recognition software utilization across industries – just to name a few. 

Imagine everything electric/electronic around you beginning to ‘speak’ to you OR being able to comprehend what you’re telling it (them?) OR BOTH (input-ouput)... 
  • Your cellphone/digicam/music player telling you they’re gonna run out of battery soon?
  • Your toaster warning you about the toasts which will burn in the next 10 seconds?
  • The GPS system (on your phone or the car) instructing you to take a left or right?
  • Various e-readers reading your children bed-time stories???
  • Not needing to type, instead just dictating (a possible relief for those with Carpal Tunnel Syndrome. Heck, even a means to avoid it)?
  • Being able to switch on/off the lights, regulate the intensity of light or the speed of the fan (home automation) by merely speaking, as opposed to doing it manually?
And remember, these are not just gonna be flat, monotonous, insipid tones that you’ll hear. It’ll hum and haw, pause and stutter, be coy or classy … all of that, as the need maybe! Something near-human; and it’s gonna be ‘things’ and not people who are the source. I guess that means that if Vicki were to be portrayed as a robot right now, she would’ve been more convincing. Ted, her ‘father’, would’ve had some good speech recognition software in there too, which means that she would’ve sounded more human and (for once) not take figures of speech literally!

And as far as voice input goes, I’m sure that some of you have already experienced the existing version, when you call customer care(s). Often they (it?) ask you to repeat the option of your choice because it isn’t legible enough for the system. To add to it, there might be a lot of ambient noise (you’re on the road, in an office, in a mall, etc.) and things begin to get irritating.

But the opportunities are immense. It finds use in a wide range of industries and the best use I can think of is with regards to the blind and physically disabled. And as with any technology, chances of it getting misused is something to watch out for. 

More sound bytes from InfyBloggers:

Text-to-Speech (TTS) systems are much advanced now.
Something you can do on your own PC/laptop (volume please!):
Open Command Prompt and paste this (and press Enter) - 

mshta vbscript:CreateObject("SAPI.SpVoice").Speak("You're on www.blogger.com")(Window.close)

Cool, huh?

No comments: