Encrypted VoIP Meets Traffic Analysis

Follow Slashdot blog updates by subscribing to our blog RSS feed

Encrypted VoIP Meets Traffic Analysis 98

Posted by CmdrTaco on Tuesday March 15, 2011 @11:42AM from the i-only-speak-in-binary dept.

Der_Yak writes "Researchers from MIT, Google, UNC Chapel Hill, and Johns Hopkins published a recent paper that presents a method for detecting spoken phrases in encrypted VoIP traffic that has been encoded using variable bitrate codecs. They claim an average accuracy of 50% and as high as 90% for specific phrases."

This discussion has been archived. No new comments can be posted.

Encrypted VoIP Meets Traffic Analysis

Load All Comments

Search 98 Comments Log In/Create an Account

Comments Filter:

- TFA != Wiretap (Score:2)
  
  by Barryke ( 772876 ) writes:
  
  No it does not work like that (Wire tapping encrypted video calls).
  It does not tap the signal, but increases your odds when guessing whether something was communicated in a specific manner.
- Re: (Score:1)
  
  by kmoser ( 1469707 ) writes:
  
  I'd tap that.
- Re:Bleh (Score:5, Informative)
  
  by Anthony Mouse ( 1927662 ) writes: on Tuesday March 15, 2011 @11:29AM (#35492090)
  
  I'm pretty sure that identifying a specific word with 50% accuracy is better than random chance. There are more than two words in the English language.
  
  Parent Share
  twitter facebook
  - Re:Bleh (Score:5, Funny)
    
    by Chrisq ( 894406 ) writes: on Tuesday March 15, 2011 @11:47AM (#35492350)
    
    Once they discover a method to wire trap encrypted video calls, that would open a new era in porn scene.
    ...
    I'm pretty sure that identifying a specific word with 50% accuracy is better than random chance. There are more than two words in the English language.
    Maybe he's talking about the porn film.90% seem to be "oh" or "yes" (or so i am told)
    
    Parent Share
    twitter facebook
    - - Re:Bleh (Score:5, Funny)
        
        by ciderbrew ( 1860166 ) writes: on Tuesday March 15, 2011 @12:34PM (#35493002)
        
        The pitch is the main thing in the art form.
        A low German voice - "ooohhh yaaaaa", over and over. then you have the high pitched Japanese squeak sound - "ii, ii, ii, kimochi". Which really gets annoying these days. It took a few years; but it IS annoying.
        
        Parent Share
        twitter facebook
  - Re: (Score:2)
    
    by Virtual_Raider ( 52165 ) writes:
    
    You mean it doesn't amount to "fuck" and "shit"? The media and the internet have fooled me again!
- - Re:Bleh (Score:5, Funny)
    
    by zill ( 1690130 ) writes: on Tuesday March 15, 2011 @11:39AM (#35492260)
    
    A'LA'IH [xkcd.com]
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by ByOhTek ( 1181381 ) writes:
  
  People only use two phrases when they talk?
  - Re: (Score:2)
    
    by Dalzhim ( 1588707 ) writes:
    
    Especially when being wiretapped.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    People only use two phrases when they talk?
    The phrases that it detects are "Badda-bing" and "Badda-boom."
    - Re: (Score:2)
      
      by ciderbrew ( 1860166 ) writes:
      
      This should have got at least one +funny.
  - Re: (Score:3)
    
    by NotQuiteReal ( 608241 ) writes:
    
    The two phrases are "can you hear me?" and "I have a bad connection, let me call you back."
- Re: (Score:3)
  
  by gstoddart ( 321705 ) writes:
  
  So on average that can't do any better than chance. Wow such great results!
  I think if half the time you can identify a phrase in a supposedly encrypted stream ... that's better than 'chance'.
  - Re: (Score:1)
    
    by Lumpy ( 12016 ) writes:
    
    Theyare looking for specific words and phrases...
    Bomb, president, freedom, take back control, uprising, constitutional....
    You know, only words that the evil terrorists would use.
    - Re: (Score:2)
      
      by fnj ( 64210 ) writes:
      
      Oops ... wait a minute ...
- Re:Bleh (Score:5, Funny)
  
  by batquux ( 323697 ) writes: on Tuesday March 15, 2011 @11:31AM (#35492146)
  
  Come on, 50% is better than most unencrypted voice recognition!
  
  Parent Share
  twitter facebook
- Re: (Score:1)
  
  by Lumpio- ( 986581 ) writes:
  
  I think there's a big difference in the probabilities of a coin toss and the probability of guessing the correct phrase of who-knows-how-many alternatives.
- - Re: (Score:1)
    
    by AlienIntelligence ( 1184493 ) writes:
    
    How many words are there in the English language - many tens of thousands at least.
    Many tens of thousands???
    I hope English is your second language.
    There are over 1 MILLION English words in common and uncommon use.
    [ http://www.languagemonitor.com/no-of-words/ [languagemonitor.com] ]
    Yes.... many, many, many tens of thousands.
    -AI
    FWIW, in response to TFA... I realize their research is on phrases. Which
    very quickly reduces the set. Since many of those words would only exist
    in very few spoken phrases.
- Re:Bleh (Score:5, Interesting)
  
  by bennomatic ( 691188 ) writes: on Tuesday March 15, 2011 @11:38AM (#35492246) Homepage
  
  This reminds me of the guy Colbert interviewed regarding the Large Hadron Collider who thought there was a 50% chance that it would destroy the universe. When questioned as to how he got those odds, he said, "Well, there's two options... either it will happen or it won't happen. 50%."
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by lwsimon ( 724555 ) writes:
    
    I remember following this logic... when I was three. No shit, I have a vivid memory of trying to figure out how proportions worked - I knew that a penny tossed would give a 50/50 split, but that other problem with two states - e.g., when I threw a rock, I'd either hit the matchbox car or I wouldn't - weren't. I gave up, and figured it out later, when I was five or so.
  - Re: (Score:2)
    
    by Magnus Pym ( 237274 ) writes:
    
    Well, assuming that he has no knowledge about how the thing works and has no other information, his computation of probabilities is technically correct :)
- - Re: (Score:1)
    
    by AlienIntelligence ( 1184493 ) writes:
    
    but they're recognizing individual words, from a set of many thousands of potential words, half the time or better.
    That's really quite impressive. And you're an idiot.
    From a set of many thousands of words...
    and he's the idiot?
    -AI
That's not good (Score:1)

by Anonymous Coward writes:

Better stick to a constant bitrate then :)
- Re: (Score:1)
  
  by WorBlux ( 1751716 ) writes:
  
  Exactly, or just add enough random data into the stream, plus the voice channel or make it look like a constant stream of random data.
So...obvious solution then? (Score:5, Interesting)

by Anthony Mouse ( 1927662 ) writes: on Tuesday March 15, 2011 @11:30AM (#35492122)

Use fixed-bitrate encoding for VoIP.

Share
twitter facebook
- Re: (Score:2)
  
  by ackthpt ( 218170 ) writes:
  
  Use fixed-bitrate encoding for VoIP.
  Better still, two cans and a length of string.
  - Re: (Score:3)
    
    by Bengie ( 1121981 ) writes:
    
    until someone gets a warrant to string tap you. You'd think the string connecting the two cans is protected by quantum randomness from the string theory, but it is not.
- Re: (Score:3, Interesting)
  
  by bsquizzato ( 413710 ) writes:
  
  Not so obvious --- now you have a much less efficient use of bandwidth to deal with.
  The article describes the method used to detect phrases ...
  At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.
  Essentially, you gather enough information about how a VBR codec could encode a speech phrase you are looking for, then predict where it was spoken by looking at the "data bursts" being sent in the media stream. We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
  - Re:So...obvious solution then? (Score:5, Informative)
    
    by Anonymous Coward writes: on Tuesday March 15, 2011 @12:09PM (#35492684)
    
    OpenSSH had a similar problem, it would leak information about your login password by the timing/size of the packets:
    http://www.ece.cmu.edu/~dawnsong/papers/ssh-timing.pdf
    I believe their solution was to introduce random NOP packets into the stream. This approach could work here too.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by modemboy ( 233342 ) writes:
      
      I immediately thought of this exploit as well. Seems to me you would need a lot of NOP packets comparatively, the login info is just a few keystrokes. Plus login info is not time sensitive on the receiving end, delays in a voice stream might not be acceptable.
  - Re: (Score:2)
    
    by buback ( 144189 ) writes:
    
    So I guess it's like how dentist understand their patients when they have their hands and tools in their mouths.
  - Re: (Score:2)
    
    by tixxit ( 1107127 ) writes:
    
    Some encrypted systems actually specify how much data can be "leaked" out per some amount of time. The idea is that, practically, you'll always lose something, so you need to determine a limit that is acceptable. I guess that while voice/sound "data" is very complex, speech is much less so and it doesn't take much data being leaked to get the gist of what was said. Since their method is essentially looking at a sequence of numbers, the more obvious solution may be to add some padding to the packets to foil
  - Re: (Score:3)
    
    by Jah-Wren Ryel ( 80510 ) writes:
    
    We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
    Any fix is going "waste" some amount of bandwidth.
    One solution to this attack may be to semi-randomly inject "nops" to bridge phoneme breaks. So instead of being able to identify individual phonemes by bandwidth spikes, attackers will be limited to identifying entire word clusters - like filling the "space" between the phonemes in the first three words of a sentence to make it look like one really long phoneme.
    But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that t
    - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that they are transmitted somewhat out of order and then re-ordered on the receiving end. That probably won't use up much extra bandwidth but would increase latency.
      Might not even need to re-order the audio, just burst it so that multiple phonemes are all "packed" together for transmission so there are much fewer phoneme breaks visible via traffic analysis. You burn latency that way too, but it would be much simpler to implement than a randomizing algorithm.
      - Re: (Score:2)
        
        by dgatwood ( 11270 ) writes:
        
        Agreed that the problem is the packing, not the data. However, grouping multiple short packets together is still leaking information. The only difference is that instead of looking at the length of packets, you have to look at the timing between packets.
        I would suggest that the right solution is to modify your code so that instead of sending out packets of varying length isochronously, you instead send out packets of the same length isochronously, and adjust the average length every... say ten seconds, a
      - QoS (Score:2)
        
        by sourcerror ( 1718066 ) writes:
        
        Thus you increase latency, which is the single most important thing in a phonecall.
    - Re: (Score:2)
      
      by NateTech ( 50881 ) writes:
      
      Using a VBR and then inserting NOP's sounds like... using a non-variable streaming CODEC.
  - Re: (Score:2)
    
    by Anthony Mouse ( 1927662 ) writes:
    
    We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
    It seems like there might be some promise in improving the compression method itself using the same techniques, so that the things that currently take more bandwidth would take less and therefore become less distinguishable, but if the compression is already near-optimal then this won't work without an efficiency loss because the change would correspondingly make the things that currently take less bandwidth take more, and those things might be more common.
    The only general solution is some kind of padding s
  - Re: (Score:2)
    
    by Peter Simpson ( 112887 ) writes:
    
    It's very clever. Seems like using a CBR encoder would defeat this method, because every packet would have the same number of samples. Being *too* efficient might save you bandwidth, but it reveals something about your speech patterns.
  - Re: (Score:2)
    
    by psydeshow ( 154300 ) writes:
    
    At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.
    Awesome.
    It's like listening to the "Mwa mwaa mwaa mwa mwa" voice that adults use in the old Peanuts television specials, and figuring out what they are saying based on the length of the "mwas" and their order in the conversation.
    - Re: (Score:2)
      
      by PReDiToR ( 687141 ) writes:
      
      You mean like trying to decipher Kenny from South Park's words?
      
      I wonder what my kids would compare it to ...
  - Re: (Score:3)
    
    by Kjella ( 173770 ) writes:
    
    Not so obvious --- now you have a much less efficient use of bandwidth to deal with.
    Enough to matter? According to my cell phone bill, I had over 100MB of data traffic last month. That's about 10 hours of 24 kbps CBR encoded voice, which is the highest possible CBR setting speex has. If it's on my DSL/cable/whatever line, who cares? Even if I did that 24x7 for a month it'd be 7-8 GB and I'm pretty sure even a teenage girl with mouth diarrhea has to sleep sometimes. If that's what it takes, I don't see CBR as being a dealbreaker.
    - Re: (Score:1)
      
      by bsquizzato ( 413710 ) writes:
      
      Now take hundreds of thousands of calls like yours running through your service provider's network, being transferred to other providers networks, etc. Or, hundreds/thousands of calls running w/in a large enterprise such as from branch offices to HQ. Bandwidth costs money. In situations like these, you try to conserve bandwidth any way you can.
    - Re: (Score:2)
      
      by Eivind ( 15695 ) writes:
      
      Not enough to matter.
      VBR *does* save bandwith for equivalent quality, but not a lot of it.
      Your 100MB gives you 10 hours of 24kbps of CBR encoded voice, and at a guess, VBR would maybe give you 13-15 hours of voice in the same bandwith.
      Certainly trivial, and certainly the answer to this problem is that encrypted voice, should be encoded CBR to make traffic-analysis impossible.
- Re:So...obvious solution then? (Score:5, Interesting)
  
  by Cthefuture ( 665326 ) writes: on Tuesday March 15, 2011 @12:22PM (#35492852)
  
  Actually most people are using G.711 these days which is in fact a fixed bitrate (it's the same protocol used on your normal "hard" voice line).
  But most VoIP providers do not offer SRTP or any encryption whatsoever so this whole thing is not even a question. More than likely anyone can listen in on your VoIP calls. We need to put more pressure on VoIP providers to offer encryption.
  
  Parent Share
  twitter facebook
- Re: (Score:1)
  
  by TuringCheck ( 1989202 ) writes:
  
  Working in telephony and VoIP for the last 8 years I don't remember seeing a VBR codec in actual use - ever. At most silence detection is used but that has unpleasant side effects too. I also find useless to save 2-3 bytes when the UDP+RTP overhead is 40 (plus at least 4 if SRTP is used).
Stalin's Dream II (Score:3)

by ackthpt ( 218170 ) writes: on Tuesday March 15, 2011 @11:31AM (#35492136) Homepage Journal

Teh Recognisining.
"I'd like to order pizza, with pepperoni, pineapple, mushroom and an Iludium Pu-36 space modulator delivered to Hall of Justice."

Share
twitter facebook
- - Re: (Score:3)
    
    by bmo ( 77928 ) writes:
    
    http://www.youtube.com/watch?v=7A4HeawmE6A [youtube.com]
    Not knowing what an Illudium Pu-36 Explosive Space Modulator means you had a deprived childhood.
    --
    BMO
    - Re: (Score:1)
      
      by AlienIntelligence ( 1184493 ) writes:
      
      http://www.youtube.com/watch?v=7A4HeawmE6A [youtube.com]
      Not knowing what an Illudium Pu-36 Explosive Space Modulator means you had a deprived childhood.
      --
      BMO
      Hear, hear!
      Marvin is the man! I mean, he's the silly thought and pseudo I use
      for this nickname.
      -AI
Duh! (Score:2, Insightful)

by Anonymous Coward writes:

When you want to secure something, you must think carefully about how you might be leaking information. You can't just slap some encryption on and call it a day.
3 years old work (Score:3)

by slashdotmsiriv ( 922939 ) writes: on Tuesday March 15, 2011 @11:59AM (#35492548)

The conference version of the paper appeared in IEEE S&P 2008.
http://cs.unc.edu/~fabian/papers/oakland08.pdf [unc.edu]

Share
twitter facebook
No shit? (Score:1)

by Anonymous Coward writes:

You mean when you vary a quality of your signal (in this case bitrate) based on content, people can read information about the content from those variations??? OMFG!
then it's shitty encryption (Score:3)

by cellocgw ( 617879 ) writes: <cellocgw.gmail@com> on Tuesday March 15, 2011 @12:25PM (#35492890) Journal

The definition (somewhere in the 'net archives) of encryption quality is how distinguishable the encrypted message is from random noise. Clearly setting bitrates, or any other parameter, based on the input, is not random.
Pick a better algorithm and/or suck it up and waste a little bandwidth.

Share
twitter facebook
- Re: (Score:2)
  
  by dachshund ( 300733 ) writes:
  
  The definition (somewhere in the 'net archives) of encryption quality is how distinguishable the encrypted message is from random noise. Clearly setting bitrates, or any other parameter, based on the input, is not random.
  (A common) definition of symmetric encryption is that a message should be indistinguishable from an equal-length string of random bits. In that sense, there's nothing wrong with this encryption scheme.
  What is wrong here is that encryption does not hide message length, and in many cases mes
Google Voice (Score:1)

by Arykor ( 966623 ) writes:

Google is involved in this? Perhaps encryption could help them improve the accuracy of transcription in Google Voice... [twitter.com]
What phrases? (Score:2)

by stillnotelf ( 1476907 ) writes:

I'm hoping it's best at picking up obvious spy phrases, like "the eagle has landed", "the moon fish squicks wickedly at midnight", "long is the gap between cacti"... Somehow I think it's probably best at "hello".
- Re: (Score:2)
  
  by DriedClexler ( 814907 ) writes:
  
  Somehow I think it's probably best at "hello".
  I'm one step ahead of these known-plaintext attacks -- no longer do I use the same, small set of voice greetings. No no -- I prepend a nonce.
  "Hello?"
  "Shgr'gl'hm-v'va Hi Mom, it's Clyde ... and you're not supposed to answer the phone like that!!!"
- Re: (Score:2)
  
  by NateTech ( 50881 ) writes:
  
  Who answers with "Hello" still? Waste of time. Look at Caller ID, "Hi XXX."
  Or... "This is XXX." That one always throws the telemarketers... "Is X there?" "Didn't I just say that?"
  Or my favorite, old military and any kind of "Operations" job folks... we just answer with our last name. One word, contact established, identity verified... go with your traffic.
  "Goodbye" is silly too. Just hang up.
Variable bit rate? (Score:2)

by s_p_oneil ( 795792 ) writes:

Did you note that they specified variable bit rate? In this case, I'll bet it had more to do with the timing and flow of the packets and bytes than with the actual content of the bytes. When there's a pause in a person's speech, there is a pause in the network traffic. Imagine someone trying to send morse code through an encrypted voice channel. Someone watching a bandwidth graph that had a high enough frequency would know exactly what coded message you sent regardless of the compression or encryption algor
RTP blinding (Score:2)

by WaffleMonster ( 969671 ) writes:

A few solutions...
Add some number of pad bytes to each packet to fill in blanks.
Tweak existing high complexity codecs (ilbc, speex..etc) to maintain a persistant bitrate by dynamically scaling quality to even out the per packet bits.
Use a fixed bitrate codec (most of these really suck from bw effeciency vs quality perspective)
Switch variability to the time domain adding jitter to mask the signal and control latency/security tradeoff.
SRTP scares me because it was invented for a single narrow purpose. Would
useless, and easy countermeasures (Score:3)

by t2t10 ( 1909766 ) writes: on Tuesday March 15, 2011 @01:17PM (#35493600)

First of all, statements like "50% accuracy" are nearly useless; you need to know both precision and recall. And to the degree that "50% accuracy" tells you anything, it tells you that the system is pretty bad.
Finally, the countermeasure for this is the same as the countermeasure for other automated speech analysis techniques: play some singing or theater in the background.

Share
twitter facebook
- Re: (Score:2)
  
  by uid7306m ( 830787 ) writes:
  
  Exactly. The phrases used are fairly long, for instance: "Laugh, dance, and sing if fortune smiles upon you." In the TIMIT corpus, there are 122 of them. In the English language, there are hmm, lots of sentences of that length. There are about 1000 different syllables in English, and I count 11 syllables in that sentence. Thus, there are some fraction of 10^33 sentences of that length.
  So, if you tried this on English, one of two things would happen. If you used that recognizer without any modifi
Nexidia (Score:1)

by randyjparker ( 543614 ) writes:

Nexidia has been selling proprietary tech to do this for years
Average accuracy of 50%? (Score:1)

by fishbowl ( 7759 ) writes:

On any digital signal, comparing a random source of bits should get you 50% accuracy.
Better than guessing? (Score:2)

by KnownIssues ( 1612961 ) writes:

I'm sure there's a mathematical/statistical reason why 50% accuracy is better than guessing in this case, but that would be very counter-intuitive. Same with as high as 90% under certain conditions. I could get to 90% accuracy if I could select out everything that reduced my accuracy as well. I don't doubt the full article explains better though. I'm not suggesting MIT, Google, etc scientists are stupid.
An exercise of pattern detection (Score:2)

by c0lo ( 1497653 ) writes:

Seems that I started to detect a pattern between the current TFA and this [slashdot.org] one.
Now, DHS, I know I'm not at MIT, but other [wikipedia.org] cases showed I don't need to... So, just where is my grant for advanced research of the subject?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

TFA != Wiretap (Score:2)

Re: (Score:1)

Re:Bleh (Score:5, Informative)

Re:Bleh (Score:5, Funny)

Re:Bleh (Score:5, Funny)

Re: (Score:2)

Re:Bleh (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re:Bleh (Score:5, Funny)

Re: (Score:1)

Re: (Score:1)

Re:Bleh (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

That's not good (Score:1)

Re: (Score:1)

So...obvious solution then? (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3, Interesting)

Re:So...obvious solution then? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

QoS (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re:So...obvious solution then? (Score:5, Interesting)

Re: (Score:1)

Stalin's Dream II (Score:3)

Re: (Score:3)

Re: (Score:1)

Duh! (Score:2, Insightful)

3 years old work (Score:3)

No shit? (Score:1)

then it's shitty encryption (Score:3)

Re: (Score:2)

Google Voice (Score:1)

What phrases? (Score:2)

Re: (Score:2)

Re: (Score:2)

Variable bit rate? (Score:2)

RTP blinding (Score:2)

useless, and easy countermeasures (Score:3)

Re: (Score:2)

Nexidia (Score:1)

Average accuracy of 50%? (Score:1)

Better than guessing? (Score:2)

An exercise of pattern detection (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals