Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Gmail Now Rejects Emails With Misleading Combinations of Unicode Characters

Soulskill posted about a month ago | from the we-look-forward-to-being-caught-in-your-new-web dept.

Communications 79

An anonymous reader writes: Google today announced it is implementing a new effort to thwart spammers and scammers: the open standard known as Unicode Consortium's "Highly Restricted" specification. In short, Gmail now rejects emails from domains that use what the Unicode community has identified as potentially misleading combinations of letters. The news today follows Google's announcement last week that Gmail has gained support for accented and non-Latin characters. The company is clearly okay with international domains, as long as they aren't abused to trick its users.

cancel ×

79 comments

Sorry! There are no comments related to the filter you selected.

GNAA (-1)

Anonymous Coward | about a month ago | (#47658359)

Gay Niggers Association of America.

Are you gay!? YES!
Are you a nigger!? YES!
Are you a gay nigger!? YES!!

Re:GNAA (-1)

Anonymous Coward | about a month ago | (#47659401)

weev, is that you?

Re:GNAA (-1)

Anonymous Coward | about a month ago | (#47659687)

????????
Profit!

ftfy

FÜÇK ÿèàh (5, Funny)

Anonymous Coward | about a month ago | (#47658379)

...

Re:FÜÇK ÿèàh (1)

Wootery (1087023) | about a month ago | (#47662367)

Could've sworn Slashdot had zero support for unicode characters.

(I appear to be unable to paste in a 'Trademark' symbol. What is this magic, AC?!)

Re:FÜÇK ÿèàh (1)

Eunuchswear (210685) | about a month ago | (#47664389)

iso-8859-1

Slashdot supports the first 256 unicode characters (except some of C0 and C1)

Good that this applies to from: and not the body (4, Interesting)

CRCulver (715279) | about a month ago | (#47658391)

...of the e-mail. Any attempt to block spam or phising on the basis of mixing character sets would have to confront the fact that some people do need to mix character sets. Typically representations of Mari in the Latin alphabet, for example, also make use of the Greek letters beta and eta. In fact, eta is used in Latin representations of several minority languages of Russia. And the Reddit crowd loves making weird smilies in their English-language writing by means of symbols drawn from Indian scripts.

Re:Good that this applies to from: and not the bod (3, Funny)

Russ1642 (1087959) | about a month ago | (#47658415)

If this spells death to those ridiculous smilies then it's ok with me.

Re:Good that this applies to from: and not the bod (0)

Anonymous Coward | about a month ago | (#47658727)

Unfortunately those aren't likely to be mistaken for latin characters... They probably get a free pass.

Re:Good that this applies to from: and not the bod (1)

ericloewe (2129490) | about a month ago | (#47658827)

They're reason enough for me to almost believe whoever designed ASCII was a genius.

Re:Good that this applies to from: and not the bod (0)

Anonymous Coward | about a month ago | (#47669877)

...and should be hanged by his nuts

Re:Good that this applies to from: and not the bod (3, Interesting)

mi (197448) | about a month ago | (#47658417)

I routinely substitute Cyrillic letters for Latin on Disqus and other forums to get around their filters (which block for more than mere "profanity").

Slashdot does not allow non-ASCII characters — although it does not attempt to screen out profanity either.

Re:Good that this applies to from: and not the bod (2)

Ichijo (607641) | about a month ago | (#47658757)

Slashdot does not allow non-ASCII characters...

...unless they're in code page 1252.

Re:Good that this applies to from: and not the bod (1)

hondo77 (324058) | about a month ago | (#47659037)

Slashdot does not allow non-ASCII characters...

Óh réällý?

Re:Good that this applies to from: and not the bod (1)

mi (197448) | about a month ago | (#47659131)

Óh réällý?

That's pretty cool. I guess, the entire ISO-8859-15 is Ok? But not Cyrillics :-( Or else, you would've seen some Ukrainian-Russian conflict right here...

Re:Good that this applies to from: and not the bod (0)

Anonymous Coward | about a month ago | (#47660997)

Boo, slashdot discards Esperanto and Polish letters; surprisingly lame!

It's not as if UTF-8 is some crazy new thing, sheesh.

Re:Good that this applies to from: and not the bod (0)

Anonymous Coward | about a month ago | (#47661007)

I'm curious about why you need to get around the filters. If you disagree with filtering in general, do you really think rebelling against it on some Internet forums is going to make a difference? Why not just move on to less-restricted forums or stay and follow the rules?

Re:Good that this applies to from: and not the bod (0)

Anonymous Coward | about a month ago | (#47672791)

He's trying to post on local news websites (TV stations) is my guess, they all have their comments farmed out to Disqus or Topix these days. The problem is you can't engage in a conversation without editing your comment 10 times, with no indication at any point what is actually being flagged as inappropriate, or else by subverting the filter as GP mentioned. They aren't just filtering vulgarity. Words like bribe and corrupt are blocked by my TV station's Topix comments, so it's hard to discuss politicians for example.

Re:Good that this applies to from: and not the bod (1)

tlhIngan (30335) | about a month ago | (#47658465)

...of the e-mail. Any attempt to block spam or phising on the basis of mixing character sets would have to confront the fact that some people do need to mix character sets. Typically representations of Mari in the Latin alphabet, for example, also make use of the Greek letters beta and eta. In fact, eta is used in Latin representations of several minority languages of Russia. And the Reddit crowd loves making weird smilies in their English-language writing by means of symbols drawn from Indian scripts.

Or perhaps more practically, needing to send email with multiple translations in them. Either as a courtesy to your audience who may speak English or French, or German, and you're not quite sure which they're more comfortable with. So you send your email with all three languages in it.

North American based companies may do English, French and Spanish in their email.

Though perhaps one area where they could block in the body is in HTML tags - if there's a restricted character in a link, perhaps that's a reason to block.

Re:Good that this applies to from: and not the bod (1)

Dutch Gun (899105) | about a month ago | (#47659893)

Heuristics could pretty easily determine if someone communicate only in English in their e-mails, and as such, any legitimate e-mails that contain large amounts of non-English words or characters should be viewed with greater suspicion. For those that routinely communicate in more than one language and use non-ascii sets, the heuristic should be able to account for that fact.

These sorts of rules are always fuzzy by nature. Obviously, whether an e-mail is determined to be legitimate or not is due to many different factors. This could simply be one of those contributing factors.

Re:Good that this applies to from: and not the bod (1)

TubeSteak (669689) | about a month ago | (#47658497)

Good that this applies to from: and not the body of the e-mail.

That's not at all good and filtering the body exactly what I want.
Spammers already spoof the from: domain and then link you out to exactly the type of domain that Gmail is now filtering.

There's no reason Gmail can't flag [body] links to domains that use mixed character sets.

You're a poopy head :P (0)

Anonymous Coward | about a month ago | (#47659489)

I found it amusing that you are aware of the existence of different fields in email and then used you post to demonstrate that you have no fucking clue how to use them. Sentences should not be split across the subject and body fields.

Homoglyph protection at last, sort of. (1)

Animats (122034) | about a month ago | (#47658419)

OK, good. Now if ICANN applied that tougher standard to domain name registrars, we'd make progress. But no, ICANN still allows registrars to register domain names without forcing them to comply with the most restrictive profile.

all of them then? (1)

hurfy (735314) | about a month ago | (#47658421)

This looks like fun, I probably wouldn't catch that bank example and family certainly wouldn't. Looks like pretty much any word could substitute one letter.

No idea exactly what these "combinations" are. The example used one letter substitution. Using this example and the little display of new letters there would appear to be billions of potentially misleading combinations.

Re:all of them then? (2)

TheGavster (774657) | about a month ago | (#47658583)

The "restrictive profile" that Google is using for the filtering is defined in Unicode as any combination of the Latin character set with another set or sets, with the exception of very specific combinations (selected legitimate combinations of Asian sets that contain radically different letter forms and thus are unlikely to cause confusion).

Re:all of them then? (1)

godrik (1287354) | about a month ago | (#47663411)

I'd like to see the precise rules (but too lazy to RTFA now). There are many non-english words that can be highly confusing. In french "telephone" is "téléphone" which could be though as a way to trick users. Also turkish have a dotless i, I would not be surprised it appears in words with similar spelling in english.

Re:all of them then? (1)

Eunuchswear (210685) | about a month ago | (#47664415)

ITYM téléphone.

Sounds bad (2, Insightful)

Anonymous Coward | about a month ago | (#47658429)

If I start a business with a unicode domain, and if later a scammer registers an ascii domain that is similar looking, then Gmail will blackhole my business, not the scammer, because I'm the one using unicode.

Re:Sounds bad (0)

Anonymous Coward | about a month ago | (#47658481)

I was also thinking about private people buying look-alike domains because "their" name was taken. especially for people wanting their own name or cool nick as email address.

Re:Sounds bad (1)

Immerman (2627577) | about a month ago | (#47658773)

Probably a bad idea - what exactly is the legitimate point of having a cool web address if *everyone* will *always* mistype it and go to the original site instead?

Re:Sounds bad (0)

Anonymous Coward | about a month ago | (#47659773)

Probably a bad idea - what exactly is the legitimate point of having a cool web address if *everyone* will *always* mistype it and go to the original site instead?

no, it's just like yahoo.com except you have to use ALT + 224 for the a. What could be easier to remember than that?

Re:Sounds bad (1)

Desolation Row (550944) | about a month ago | (#47659903)

In the Russian borrowed word "radio", the Cyrillic characters a and o look identical to the same English letters (the rest are completely different).

The Russian word "radio" should be in (the specific Russian Cyrillic subset of) Unicode/UTF, while the English word "radio" should be in Unicode/ASCII.

Mixing and matching character sets in URLs or email address typically indicates "intent to confuse". Within text, it usually just confuses translators and spell checkers.

Re: Sounds bad (0)

Anonymous Coward | about a month ago | (#47675879)

I've read your comment several times and still don't understand.

First ÑÐÐÐо would be allowed as well as radio, but paÐÐo etc. would be disallowed.

Could you simplify and restate your point?

whack-a-mole 3.0 (2, Insightful)

Anonymous Coward | about a month ago | (#47658439)

And the latest round of whack-a-mole begins...

This is going to do... (1)

sudden.zero (981475) | about a month ago | (#47658441)

...absolutely nothing! The scammers will just find some other way to create their automated email garbage.

Re:This is going to do... (1)

pla (258480) | about a month ago | (#47658913)

This is going to do...absolutely nothing! The scammers will just find some other way to create their automated email garbage.

You kidding? Thanks to allowing these new email addresses, I have an entirely new category of auto-deletable spam. These won't "confuse" me because I'll never see them. Win/Win!

Go ahead, spammers, get cute. Just makes my life thaaat much easier.

Re:This is going to do... (1)

AHuxley (892839) | about a month ago | (#47659089)

It looks after working with ads in English.
It looks after other interested parties looking for expected keywords.

&^*308cbpBO)780i76D$^*.//.we0-fw (0)

pigiron (104729) | about a month ago | (#47658443)

q898(^*$*EUIDXEZ{Pm;vd80eGUIOIO:>P{
{}.

det6767ir6768P)I*)&%B(()_}K>?YIBV$WCJ!!!!!

Re:&^*308cbpBO)780i76D$^*.//.we0-fw (1)

sudden.zero (981475) | about a month ago | (#47658475)

Ah but did you use Unicode to make those characters?

Re:&^*308cbpBO)780i76D$^*.//.we0-fw (1)

pigiron (104729) | about a month ago | (#47658545)

ÂÂâ¥
Ã¥â(TM)â(TM)âs--ðYfâSâ±âââOEâSoeâ...â'ââoe

Finally! (1)

nospam007 (722110) | about a month ago | (#47658511)

Damn, now i see it's just domains, i tought they killed all my german and french spammers.

Re:Finally! (0)

Anonymous Coward | about a month ago | (#47658569)

That would be murder

Re:Finally! (0)

Anonymous Coward | about a month ago | (#47658703)

He said "spammers". Murder victims have to be human.

Re:Finally! (1)

Drumhellar (1656065) | about a month ago | (#47658771)

At worst, it's illegal dumping.

Re:Finally! (0)

Anonymous Coward | about a month ago | (#47662849)

Untrue. Go kill a police dog and they'll put you away for life.

Re:Finally! (0)

Anonymous Coward | about a month ago | (#47659023)

i tought they killed all my german and french spammers

I tought I taw a puddy tat.

Re:Finally! (0)

Anonymous Coward | about a month ago | (#47663071)

Romulan...

Why are we still blocking spam ? (3, Interesting)

Anonymous Coward | about a month ago | (#47658573)

90% of the population would be better off with a white listed email account, i.e. if you are not on their list the email does not get through. END OF STORY.

I would seem to be more efficient to filter mail IN than to filter it out. Most people would have 20 or so people they actually want mail from.
I have mail accounts strictly for family and my local email rules enforce this
I have mail accounts for "sign up" sessions for competitions that I know are going to get spammed to hell
I have mail account for work, another for my business , etc etc all with differing contacts.

White listing would pretty much kill off spam, if there is zero chance of it getting though, what is the point. Currently spammers get through because of out dated spam lists, new tricks to get around baynesian filters, etc etc etc. White lists would negate the need.

Google, if you set up a white listed email system, my friends and family will happily sign up.

... because we make new friends (1)

Anonymous Coward | about a month ago | (#47658621)

Seriously,
most filters are now "very good". And, I make new acquaintenances, connections and friends. They have new email addresses that aren't in the whitelist. But, the filters pretty much just work.

Re:... because we make new friends (1)

Anonymous Coward | about a month ago | (#47659129)

One way you could make whitelists work is to have a "secret handshake", a word that you require in the subject of mail from addresses that aren't whitelisted yet. You would regularly change that word and give it to new acquaintances along with your email address.

The problem with the whitelist approach is something else: A lot of spam already pretends to be from someone you know. Spammers don't just collect individual email addresses anymore. They collect email address pairs: Who knows who.

Re:... because we make new friends (0)

Anonymous Coward | about a month ago | (#47659877)

Yes, but a white list would help stop the morons from clicking on the link of "xxxyyyy-celebrity-nude.exe"

A whitelist that also confirms where it was actually sent from would also prevent that form of abuse.

Re:... because we make new friends (0)

Anonymous Coward | about a month ago | (#47659403)

Its easy to add then to a white list

Re:Why are we still blocking spam ? (1)

Dutch Gun (899105) | about a month ago | (#47660075)

E-mail authentication seems like a better solution than whitelisting in the long term. Whitelisting can kill off spam, but that's sort of like saying you can fix a broken arm by amputation. It's technically true, but removes a lot of useful functionality.

The big problem with e-mail spam is that the e-mail sender can be trivially forged. If we employed ubiquitous authentication systems that proved a specific domain was used, and blocked non-authenticated users (or at the very least, flag them with a big warning), it would go a long way to solving the spam problem. Moreover, if a particular domain is repeatedly being used by spammers or scammers, that can provide additional heuristic information to the filters.

Unfortunately, there are too many competing authenticating standards and (presumably) far too much legacy code that would be broken by moving to such a system. Given the ridiculous amounts of spamming and scamming going on by e-mail, it really seems like it would be worth the short-term pain to buckle down and select a single, robust solution, and block anything that doesn't use it.

The world just isn't the same when the SMTP protocol was invented. It's ridiculous, not to mention slightly worrisome, that the only way we can practically use e-mail is if the combined technical might of Google or some other large enterprise helps us to filter out 99% of the crap so we can view the 1% that isn't.

Re:Why are we still blocking spam ? (0)

Anonymous Coward | about a month ago | (#47660275)

Think of white listing like giving people access to your house.

A few people will get keys
If you are not there it is locked and therefore you need a key to get in (a white list)
If you are there you can admit other people, the choice is yours, however it is simply a white list with 1 time admittance.

(yes I know about buglers , however that does not change the point being made)

So with a white list, even if someone hacks the system to send spam, it will most likely only go to 20-30 people, a lot of effort for little gain
Currently spambots send out billions of spams, this could not easily happen with a white list that authenticates the sender. And the easy solution is to remove them from the white list until the fix their machine. If a secure web page with 2 factor authentication is used to modify the whitelist with the ISP then adding/changing the whitelist is also difficult.

Spam would effectively die a natural death because with so few people getting any, and so difficult to send it to anyone the money chain would break. However
Nigerian scams... gone
Penis enlarger spams ....gone
Fake Pharmacies ..... gone
penny stock scams .... gone
fake invoice scams .... gone
phishing scams ..... severely damaged

What the internet needs is a Unix like approach where security and user separation is built in as opposed to the security as an ad-on windows approach. One is much harder to break into than the other, neither is perfect, but one is definitely a better first choice.

Re:Why are we still blocking spam ? (0)

Anonymous Coward | about a month ago | (#47660871)

90% of people don't have the time to manage a whitelist. 9% can't afford the risk of a class positive. Congratulations! You finally made it to the 1%!!

Re:Why are we still blocking spam ? (1)

IamTheRealMike (537420) | about a month ago | (#47661399)

Google, if you set up a white listed email system, my friends and family will happily sign up.

They already happily sign up. Gmail is the largest email provider in the world.

BTW the Gmail spam filter, like any good one, does have per-user whitelists. If you reply to mail or mark mail from a sender as not spam, the filter will leave mail from those senders alone (modulo caveats like the sender properly authenticating). Thus the filter spends almost all of its effort on email from senders you haven't interacted with, like, for example, the password reset mail from the website you used 3 years ago and forgot how to log in. You wouldn't want to lose those, would you?

Re:Why are we still blocking spam ? (1)

AmiMoJo (196126) | about a month ago | (#47662107)

A whitelist would break site sign-up and password reset emails. You could never whitelist every legit site as hundreds are launched every day. Users will never figure out how to add sites to their whitelists before signing up, and can barely cope with such emails ending up in their spam folders.

Having said that, gmail filters 99.9% of spam for me, and I can tolerate hitting delete for the 1 in 1000 that gets though.

Why are we still blocking spam ? (1)

perryizgr8 (1370173) | about a month ago | (#47700839)

Google, if you set up a white listed email system, my friends and family will happily sign up.

They did, it's called Google+. Nobody seems to like it.

Don't use Unicode for network stuff (1)

Anonymous Coward | about a month ago | (#47658579)

If you use Unicode for domains, addresses, certificates and whatnot you are begging for an endless cascade of support problems and glitches, not to mention security vulnerabilities. Let others exercise all these broken codes paths for you while you avoid the fail. Eventually, after most of the broken code gets cycled out of use, many years from now, you may then safely allow this stuff into real systems.

Unicode breaks all sorts of stuff in subtle and unfixed ways. A fine example from a widely used Microsoft system (W2K8 R2 SP1, three years old) is this gem: http://support.microsoft.com/kb/2597665; IIS can't handle Unicode attributes in x509 certs. You have to "hotfix" that broken OS to deal with Unicode.

Just leave it be another decade or so, if you can.

For those of you frothing at the mouth to write "BUT BUT I HAVE TOO!!!!1" re-read the end of that last sentence over and over till it sinks in; not everyone can avoid dealing with this. My sympathies. I'm writing for those that can.

Re:Don't use Unicode for network stuff (1)

Immerman (2627577) | about a month ago | (#47658837)

But why would anyone waste resources properly fixing a bug that doesn't affect anyone? The only way these things will get fixed properly, is if they start causing a lot of problems. And the only way they'll cause problems is if people start using them.

Meanwhile, why should most of the world's population have to deal with an internet incapable of handling addresses in their language? How would you like it if you woke up tomorrow to discover that all web addresses could only be written in Arabic? The Web may have been invented in the US, but it belongs to the world now.

Re:Don't use Unicode for network stuff (0)

Anonymous Coward | about a month ago | (#47659075)

why should most of the world's population have to deal with an internet incapable of handling addresses in their language?

The world has been using the internet for close to two decades without domains and email addresses in anything but ASCII. It works, and that is more important than vanity. The web was invented in Switzerland (not in the US), but if computers and the internet had been invented in the Middle East, we would all have learned enough Arabic to use them. Addresses must be interoperable for a world wide network to function. The content may well be in any number of native languages and scripts, but the addresses must be simple, and Unicode isn't. It is better for all to learn simple addresses than for all to learn complex addresses.

Slippery slope (1)

Dishwasha (125561) | about a month ago | (#47658595)

As much as I can appreciate the intent and the fact that this will solve 99.999% of people's problems for this type of spamming and create 00.0000000001% of problems for legitimate users, it still feels a little like Google is trying to be the thought police on this one; you know free speech and all.

More generally (1)

SigmundFloyd (994648) | about a month ago | (#47658635)

IME, Gmail is rejecting a lot of legitimate mail nowadays.

Their filters used to be good, but they completely fucked it up lately.

non Latin characters? (0)

Anonymous Coward | about a month ago | (#47658739)

I never did see a domain with non-Latin characters in spam. I have seen Russian, Chinese or Japanese text in the body and subject line.

But how... (0)

Anonymous Coward | about a month ago | (#47658761)

will I talk to ZALGO!

Al (1)

jones_supa (887896) | about a month ago | (#47658791)

As an interesting background fact, I heard that Google has an advanced Al doing all this stuff completely autonomously.

His real name is Albert, by the way.

Unicode for addresses is a bad idea (0)

Anonymous Coward | about a month ago | (#47658839)

Addresses should be simple and easy to learn and transmit over as many means of transport as possible. We had a working world-wide de-facto standard: 7-bit ASCII. Sure, there were no accented letters, no support for Asian scripts, etc., but it worked. Addresses are infrastructure. You can send anything you want as content. If you need to write Hindi in an email, then do so. That should not require all mail masters to upgrade their software to handle Hindi.

(I write this as someone whose native language has letters beyond ASCII.)

GMail doesn't take everything... (1)

The New Guy 2.0 (3497907) | about a month ago | (#47658901)

GMail doesn't accept all comers. Get too many complaints and they'll reject you... this is just new ideas to add to that filter. There's a list of words you can't say on GMail without it getting read, they don't publish those lists because that'll never be said to them.

Re:GMail doesn't take everything... (0)

Anonymous Coward | about a month ago | (#47658981)

There's a list of words you can't say on GMail without it getting read

"Shit, piss, fuck, cunt, cocksucker, motherfucker, and tits"?

Unicode the standard .. (1)

lippydude (3635849) | about a month ago | (#47658955)

And so this "standard" was designed in this way because country A didn't want it's script mixed up with country B, introducing vulnerabilities into the DNS system in the process. As in '' '' and 'A' all encode to different unicode er .. codes.

Re:Unicode the standard .. (1)

Anonymous Coward | about a month ago | (#47659027)

They're called "code points" actually. A particular code point can be encoded in different ways (for example, the encoding of 'ß' in UTF-8 is different from the encoding in UTF-16, but they both represent the same code point.) Yeah, something like that ought to be used for network addresses...

Re:Unicode the standard .. (0)

Anonymous Coward | about a month ago | (#47659555)

Why yes, we should collapse all similar looking letters into one, why didn't anyone think about that before!

Not that it matters, for example, that tolower("AT") should properly return "at", "a(cyrillic t)" or "(alpha)(tau)" depending on whether it was originally typed in English, Russian or Greek.

Re:Unicode the standard .. (1)

lippydude (3635849) | about a month ago | (#47659657)

"Why yes, we should collapse all similar looking letters into one, why didn't anyone think about that before! Not that it matters, for example, that tolower("AT") should properly return "at", "a(cyrillic t)" or "(alpha)(tau)" depending on whether it was originally typed in English, Russian or Greek."

Have one set of 'code points' for every language on the planet and remove the duplicates. That way they wouldn't have needed to hack unicode in order to allow for the following:

'the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet)' ref [wikipedia.org]

Re:Unicode the standard .. (0)

Anonymous Coward | about a month ago | (#47660033)

I see you've already drifted away from "introducing vulnerabilities into DNS".

Combining characters are extremely helpful for text input - I need a lot of accented characters, and I have them all with a bit of customization and without special support from OS or complex layout including every character. I've simply added combining ~/`/'/" on AltGr+corresponding key, and got myself full set of diacritics. I can easily do this in any OS that supports Unicode.

With your proposal, I'd either need a special IME that converts letter+AltGr+... into accented character, or a layout that accomodates all the variants I need.

Also, don't forget that combining marks are not limited to umlauts/acute marks and so on. There are more complex writing systems than Latin alphabet with specific and more involved combining characters in Unicode.

TL;DR: What you consider "simplifying" is actually only a dubious simplification of a single facet of text processing to a detriment of other facets covered by Unicode.

They are right - Uses of unicode ambiguous letters (1)

enriquevagu (1026480) | about a month ago | (#47659157)

They are right doing so. There are letters in different alphabets whose typing is very very similar -- or in fact they are written exactly the same, depending on the font used.

This can be exploited for interesting uses. For example, "E" and "ÃZ"** are respectively the latin "e" and the greek "epsilon" vowels, but they are indistinguishable in caps, at least in Arial font. The second one is the UTF 395 code. My name has an "E" on it, and for my email signature I spell my name using the traditional latin letter from the keyboard when the email is important and should be archived. By contrast, when the email is mostly irrelevant for future use (such as meeting arrangement emails, which are useless after the meeting takes place) I spell my name using the Greek epsilon letter (hint: 395 followed by Alt+X in most Windows programs). There is no obvious difference for the receiver, but a search tool can be used to quickly find all sent emails which can be deleted safely.

While the previous is a somehow "legit" use, in general any word which combines letters from different alphabets could be used to confuse an trick the receiver, for example by creating an email account which reads exactly the same as the one from another person. There is a nice image of 5 letters a-b-c-d-e in different alphabets in the linked post. I agree with Google in preventing such combinations for email accounts. It would be interesting to know the exact policy used to forbid account names, which is not detailed.

** At the time of writing, these two letters look exactly the same. Classic Slashdot lacks Unicode support and does not represent the greek Unicode letter from my comment. I tried logging into Slashdot Beta (first time, I swear it!!) and it seems to represent a different letter... Please try this on your own computer!

How about starting with dropping obvious spam? (0)

Anonymous Coward | about a month ago | (#47659563)

I must have about 50 filters to auto-delete some of the really basic, obvious spam that Google accepts to my gmail account, and I still get 20-30 spams a day. Auto-filtered into my spam folder, but even so I still have to look at it because I do get the occasional false positive.

In contrast to my DNS-RBL+SpamAssassin+procmail I have for my own domain MTA which successfully turns away or drops about 99.99% of the spam that arrives at that email address.

I thought those guys at google were supposed to be smart. If they're so smart, why can't their mail system recognize the obvious spam?

Sounds rather ethnocentric (2)

Chrisq (894406) | about a month ago | (#47661003)

It allows combinations of Latin + Han + Hiragana + Katakana; Latin + Han + Bopomofo; or Latin + Han + Hangul.

There are a lot of equally safe combinations - what about Latin + Devanagari + Tamil? There would be no look-alike characters and it would allow a lot of people to put their name in multiple scripts that are likely to be meaningful to certain audiences (e.g. someone from Tamil Nadu sending an email to people throughout India and internationally). I'm sure that there are many other combinations that wouldn't have "look alike" issues but which would be useful

Insufficient (1)

taikedz (2782065) | about a month ago | (#47661299)

The "highly restricted" spec is meant to catch suspicious combos like in the mybank example - but does not catch full-ascii (which is an even more restrictive level) trickery like tvvitter.com (notice the two "v" chars). that combo in particular is now known, but goes to demonstrate that trickery does not need charsets larger than 7-bit... some people simply get caught by hsbc.net...

Do observe (0)

Anonymous Coward | about a month ago | (#47661415)

That we have a supposedly "universal" characterset that is not universally usable without considerable bolted on as an afterthought blacklists and whitelists. In fact, spotify already learned the hard way that the standard ways to compare unicode strings just don't cut it, and inventing your own is fraught with peril. There's much more slightly, subtly, insidiously "off" with unicode, before we consider the cost in code size and its associated costs.

In other words, it's not really suitable for real-world use, for you can only (and then only so-so) trust it if you generated it yourself. As soon as the unicode comes from elsewhere it's a liability to safely reading the input.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>