Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Microsoft

Microsoft Shows Off Adaptive, Multilingual Text to Speech System 171

MrSeb writes about a really cool project from Microsoft's speech research group. From the article: "Microsoft Research has shown off software that translates your spoken words into another language while preserving the accent, timbre, and intonation of your actual voice. In a demo of the prototype software, Rick Rashid, Microsoft's chief research officer, said a long sentence in English, and then had it translated into Spanish, Italian, and Mandarin. You can definitely hear an edge of digitized 'Microsoft Sam,' but overall it's remarkable how the three translations still sound just like Rashid. The translation requires an hour of training, but after that there's no reason why it couldn't be run in real time on a smartphone, or near-real-time with a cloud backend. Imagine this tech in a two-way setup. You speak into your smartphone, and it comes out in their language. Then, the person you're talking to speaks into your smartphone and their voice comes out in your language." The Techfest 2012 keynote has a demo of the technology around minute 13:00.
This discussion has been archived. No new comments can be posted.

Microsoft Shows Off Adaptive, Multilingual Text to Speech System

Comments Filter:
  • Arby 'n' the Chief wouldn't be the same without him!
    • Perhaps, but Microsoft Bob [wikipedia.org] was the more memorable.
    • I've never been the greatest Microsoft advocate, but if that system indeed works as well as promised MS will go risen in my esteem. Not that they will care at all. But compared to the anaemic progress made by Google Translate over the past years (in both design and efficiency regards), Microsoft would deserve some increased consideration.
      • I tried it, but it keeps repeating the same sentence over and over:
        "Don't run, we are your friends"
      • by AJH16 ( 940784 )

        Microsoft Research comes up with some pretty awesome concepts. Not all of them ever see the light of day, but they are one of the best R&D shops around in the tech world.

  • AZN (Score:2, Insightful)

    by willie3204 ( 444890 )

    Japanese please!!!!

  • Will they license this for PBX systems other than their own?

    I would love a multilingual system like this. The audio is really good compared to the paid software that I have access to.

    • by malakai ( 136531 )

      It's built using the MS Speech platform. Their may be a port for Mac ( Office for Mac have TTS? ), but in general for a PBX system to use this, this part of the system has to be running windows.

      That said, the voices have been free. You can buy MS Speech voices from 3rd parties for lots of money if you want some more natural voices. This seems like a step towards the eventually downfall of highly trained specialized voices. The concept here is, hire a new voice actress, spend an hour, and translate her into

  • by Anonymous Coward on Monday March 12, 2012 @10:18PM (#39334757)

    "Programmeurs, programmeurs, programmeurs, programmeurs, programmeurs!"

    • by grcumb ( 781340 )

      SAM: "Ich bin ein Developer! Developer! Developer! Developer! Developer! Developer! Developer! Developer!STOP 80000X21 OOM_MONKEYDANCE_INFINITE_LOOP"

  • by account_deleted ( 4530225 ) on Monday March 12, 2012 @10:24PM (#39334801)
    Comment removed based on user account deletion
  • by MobileTatsu-NJG ( 946591 ) on Monday March 12, 2012 @10:33PM (#39334879)

    Remember a couple of weeks ago when we had that story about scifi nitpicks and someone griped about aliens in Star Trek always speaking English?

  • was for me at university anything that could make that go away is a good thing as far as I'm concerned. (Well, that's got to be at least 0 mod but I've got karma to spare so I don't care.)
    • I took several different languages. I am admittedly biased in that I'm a dyed in the wool linguaphile, but maybe you just had a shitty professor. In a couple of my classes there were people who wanted nothing to do with learning a language, but a good professor is what made the experience (for them anyway) bearable or even at times enjoyable. Well, as enjoyable as a class can be anyway.

  • ...can they explain to me what "do the needful" means? That's English to English, and I don't fully understand the subtext of it.
    • by jfengel ( 409917 )

      It just means "do what needs to be done". There's no particular subtext to it, though I'm sure it's probably more common in some regional dialects than others.

    • It means prepare to revert the same.

  • Isn't this the same thing that Project Festival has been doing since about 2004?

    http://www.cstr.ed.ac.uk/projects/festival/ [ed.ac.uk] (try the demo)

  • by theNAM666 ( 179776 ) on Monday March 12, 2012 @11:21PM (#39335213)

    1) The translations aren't semantically equivalent (as pointed out by commenters above above). I can already say "Ich bin ein dummer Amerikaner" in my own voice, without machine help. If the meaning isn't there, who cares?

    2) The machine accent ain't that great, either.

    All of this makes me think this is still somewhat of a pipe dream. The AI guys have been selling the idea of machine translation for years and years-- at least since the 50s, when it was promised to eliminate the need for trained State Department linguists. It's never emerged because it's still a hard problem. Even Google's translate, which beats the MS stuff by some yards, produces results which range from awkward phrasing to just plain inaccurate and misleading.

    He's selling a great idea, but it's kind of like the Fountain of Youth. It ain't there, vaporware.

    • All of this makes me think this is still somewhat of a pipe dream. The AI guys have been selling the idea of machine translation for years and years-- at least since the 50s, when it was promised to eliminate the need for trained State Department linguists. It's never emerged because it's still a hard problem.

      Yeap. If you can't solve the hard problem, solve an easier one that looks similar. That's what these guys have done.

    • 3) You have to train it for an hour?

      I was actually slightly interested until I got to this bit and realized, like any other Microsoft "innovation," it wasn't really at all. Anyone can make a custom voice sample in about an hour. Hooking up simple voice recognition and text-to-speech is incredibly dull.

      Had they actually interpreted intonation for semantics, and simulated and learned your voice in real time, it would have been pretty neat.

      • by dave420 ( 699308 )

        Shut down innovation, folks, as nothing's perfect! Close it down, boys, and head back to the caves.

        Seriously, this is a ridiculously-early look at the technology. Calling it a fail is incredibly premature. FUD's not cool when anyone spreads it, remember?

    • by NoKaOi ( 1415755 ) on Tuesday March 13, 2012 @12:44AM (#39335697)

      He's selling a great idea, but it's kind of like the Fountain of Youth. It ain't there, vaporware.

      Is he actually trying to sell a mature product, or is he just showing something cool? I'm not sure where the innovation is, if it's in being able to train text-to-speech to sound like your voice, preserving intonations and such across the translation (even though it's obviously not great at it yet), or if it's just in putting a few existing technologies together, but you have speech recognition, and a translator, and text to speech that sounds like your voice, then this is what you can have. Include preserving the intonation and you have something cool. So what if it's just showing off a cool application of existing technologies?

      Translators aren't great but are getting better...speech recognition isn't great but is getting better. Preserving intonation across the translation and including in text-to-speech in a voice that sounds kinda like your own can probably get better too. Put the 3 together and you get something useful. I think that's all it's trying to show, and I think as these technologies get better we could end up with something pretty cool.

      If this was a something out of any other company, would the same people be criticizing it?

      • >If this was a something out of any other company, would the same people be criticizing i

        Ehhn. I dunno. I'll say this. I'll give your answer 10 microLenats.

  • by guttentag ( 313541 ) on Monday March 12, 2012 @11:42PM (#39335351) Journal
    American Businessman (via translated phone call): "I think we can safely say our company would like to use your factory to produce our useless stuff people think they need."
    Chinese Businessman (via translated phone call): "An excellent idea! I suggest we sign the papers over dinner at Translate Server Error [boingboing.net]. They have the best HuMan chicken in town. And the owner prides himself on his bilingual staff."

    So, two problems.

    One, our text translation software isn't foolproof, but people expect it to be. What happens when the software confuses "galleta" (Spanish for "cookie") with "callate" (Spanish for "shut up"). They do sound similar if you say them out loud, but no one notices because you'd almost never use both in the same conversation. I foresee someone attempting a friendly gesture by offering to share her mother's recipe for "shut up."

    Two, live conversations depend upon both parties building on a shared experience. If each one has a different account of the experience, conversations break down very quickly. Ever tried to carry on a conversation with a schizophrenic? And that's just assuming the errors are innocent. What happens when corporations start using this? Your bank requires you to call a number to activate your new card and during the call they have the software "translate" some required disclosure for you, only the translation doesn't really convey what they are supposed to be disclosing. Don't think it won't happen... whoever implements this first on purpose will be running the company one day.

    Then again, this whole discussion is purely academic. Gene Roddenberry's estate will just claim prior art [memory-alpha.org] and prevent this from ever becoming a reality. Hopefully.
    • by malakai ( 136531 ) on Tuesday March 13, 2012 @12:48AM (#39335723) Journal

      . I foresee someone attempting a friendly gesture by offering to share her mother's recipe for "shut up."

      Context is context. Obviously, an English speaker hearing a Spanish speaker offer to share a recipe for "shut up" on a (up until this point) benign and friendly conference call is going to assume translation error. Better than that, translation software knows about these little mix ups better than you do. On a Text To Speech, there's not much to do but suffer the mis-translation ( or maybe they play an audble 'ping' when they warn about a context or idiosyncrasy error), but in a system that displays you something on a device, these things tend to be shaded a different color, and offer options as to what other possible meaning they may have meant, based on context.

      One, our text translation software isn't foolproof, but people expect it to be.

      No, they don't. No one even expects paid human translators to be perfect.

      Two, live conversations depend upon both parties building on a shared experience. If each one has a different account of the experience, conversations break down very quickly. Ever tried to carry on a conversation with a schizophrenic?

      Honestly, with a schizophrenic, chances are I have, at some point in my life, on IRC. But more to your point, i've played games where opposing sides are communicating from different languages via google translate. Think Russia vs US, and the only way to talk to them is via delayed google translate results. It's slow, it's tedious, and yet we somehow managed to have amazing rapport with people of like mind. The assholes were still assholes via google translate, and the people we wanted to work with we managed to communicate with. Again, you are ignoring the fact than incrementally better translation is still better than it's predecessor. For now. Sure, one day we'll identify some uncanny valley with voice translation, and we'll all spend lots of time plotting how bad the translation software has to be for us to feel it's robotic.... but for now, any small step forward is better than the previous one.

      Then again, this whole discussion is purely academic. Gene Roddenberry's estate will just claim prior art [memory-alpha.org] and prevent this from ever becoming a reality. Hopefully.

      Yup, god forbid someone spends time and money on a problem that sci-fi writers got to magically make disappear in one sentence, and a prop. Maybe someday some brilliant young chap will figure out how to make warp drive not require 3x the mass of the universe for power, and Gene's children can make some more cash. Hopefully.

  • Microsoft Research comes up with a prototype that barely works. Apple wraps it up and gives it a foreign name and sells it like crazy.

  • by a_hanso ( 1891616 ) on Tuesday March 13, 2012 @01:04AM (#39335795) Journal
    Do you know who the scientist [microsoft.com] is? Because of this man's work, his grandson will never be able to get Data to pronounce contractions properly.
  • Monty Python's Entry:
    I will not buy this record; it is scratched.
    I will not buy this TOBACCONIST, IT is scratched!
    Would you laaahik... would you LIKE to come back to my place, bouncy bouncy?
    My nipples explode with delight!
    Aah just go watch it yourself! http://www.youtube.com/watch?v=G6D1YI-41ao [youtube.com]

    Frank Zappa's entry:

    This is my left hand.
    This is my right hand.
    I have a big bunch of dick.
    Aah, just go watch it yourself! http://www.youtube.com/watch?v=CkCYJ6FK0T4 [youtube.com]

    Isn't teh internets great?

  • Accent?
    The summary says "preserving the accent, timbre, and intonation of your actual voice". Now i can get timbre and intonation but accent? It made me wonder what does Mandarin with a Scottish accent sound like, does it apply Scottish speech tones, which would make it unintelligibly, or is it clever enough to find a social equivalent, maybe an accent of a small semi-autonomous region of China?

    Unfortunately checking TFA reveals this "accent" part to be the slashdot reporter's fantasy.
  • ...as there exists already an international phonetic alphabet [wikipedia.org], an alphabet that includes annotations for lilts, gutteral intonations and such. Why not just add the IPA pronunciation of each word to a given language dictionary, and have the computer read that? This would greatly reduce the 'training' work needed by the end user. It would also open new possibilities for text-to-speech translation, or even speech-to-speech translation.

    To date I have found no text-to-speech reader on any platform that can under

    • The training has to handle the way real people speak as opposed to the idealized way the words are transcribed. The sounds of words change when they are pronounced in a single sentence as opposed to individually. A single word is often pronounced multiple ways in a single English sentence. The IPA dictionary is also unlikely to be able to handle accents. From that article on the IPA it mentions that not all tones are supported. Chances are there are various other phonemes that the IPA doesn't support.

      Above

    • The closest text-to-speech program today is eSpeak which uses an ASCII variant of IPA phonemes. The problem with this is that it has a voice for each language (which is essentially the same voice) with a subset of the IPA phonemes available. I am intending to use IPA fully in my own text-to-speech program and associated voices (http://rhdunn.github.com/cainteoir/) but haven't gotten to implementing the text to phoneme and phoneme to audio parts yet, nor the associated tools for working with them and the dif

    • by Ksevio ( 865461 )
      IPA symbols are tricky because they're not standard ASCII. the SAMPA [wikipedia.org] alphabet takes the IPA symbols and replaces them with 1-2 ASCII characters. There are a few TTS readers that are capable of speaking SAMPA symbols.
  • Dear aunt, let's set so double the killer delete select all

  • by tenco ( 773732 ) on Tuesday March 13, 2012 @06:31AM (#39336861)

    ... if only my software could translate a bytestream of type video/x-ms-asf into a video.

    In light of this experience, why should i believe that someone actually invented a unidirectional universal translator? Nice try.

  • Microsoft has shown more than it has shipped, and that is bad.
  • ... when released, will it run on Linux? Or will it be open-sourced?

For God's sake, stop researching for a while and begin to think!

Working...