Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Why Unicode Won't Work on the Internet

Hemos posted more than 13 years ago | from the -Linguistic,-Political,-and-Technical-Limitations dept.

The Internet 416

We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world's languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technical: the recently announced Unicode 3.1 won't work either.)" Read the full article.

cancel ×


Sorry! There are no comments related to the filter you selected.

Is this a problem? (1)

Anonymous Coward | more than 13 years ago | (#175348)

English is not only the de facto standard of the internet but also a world language.

Introducing foreign language character sets and languages only splinters the internet into artificial factions and we end up having borders on the net. Is that what you want?

Re:another drawback of unicode (1)

Anonymous Coward | more than 13 years ago | (#175349)

That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.

Must...resist... making... lame ... Perl readability joke... arrrghhh....

Re:Is this a problem? - FYI (1)

Anonymous Coward | more than 13 years ago | (#175350)

FYI, There are more Chinesse who speak English than there are Americans who speak english. (I saw that on the Discovery Channel :p )

Define two unicode escape chars = 196000 chars. (1)

Anonymous Coward | more than 13 years ago | (#175351)

What's the big whoop? Just define two measly escape characters in the 16 bit unicode set that mean "look at next two bytes for real character". This way unicode stays 16 bit for the bulk of the world and can expand when needed.

In fact I'd propose 8 bit ASCII as the standard with say, 4 escape characters, each of which is followed by two bytes. This allows 252 + 64K + 64K + 64K + 64K or roughly 256000 characters and does so WITHOUT breaking most ASCII based services and code out on the net and in the world.

Keep it compatible, stupid!

Unicode Character Set vs Character Encoding (5)

Jordy (440) | more than 13 years ago | (#175370)

The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard).
The biggest problem with Unicode is that no one understands what it is. Unicode defines two things, a character set that maps a character into a character code and a number of encoding methods that map a character code into a byte sequence.

ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.

UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.

Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.

UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.

A good document on this is available at UTF-8 And Unicode FAQ []

Re:Duh. (2)

jandrese (485) | more than 13 years ago | (#175371)

Just because you can't read other langauges doesn't mean multi-language support is useless. Oh, and inputting Kanji on a keyboard is quite feasable, try using the Windows IME sometime (It's built into 2000).

Down that path lies madness. On the other hand, the road to hell is paved with melting snowballs.

Uh, I Don't Get It (1)

Aaron M. Renn (539) | more than 13 years ago | (#175372)

It looks like the argument is that since ancient Chinese texts can't be fully reproduced in Unicode, that the standard is flawed. I disagree. There is already a four byte character set out there - UCS-4 I believe, which is ISO something or other - that will easily handle all characters as necessary. This set can be used for replicating all XX,000 old school Chinese characters. Thinks of it as the SGML of character sets. But for common applications, XML (ie, Unicode) will continue to do just nicely.

Unicode includes all common Asian character sets (3)

Per Abrahamsen (1397) | more than 13 years ago | (#175375)

I.e. all the character sets *in common use* in Asia today, maps into a subset of Unicode. They even map into the 16 bit subset, but overlap in a way that make slightly different characters from different character sets share the same code point. That is why an extended version of Unicode is used, so Chinese/Japanese/Korean characters have different codepoints.

Unicode does not contain all characters ever used, for example it does not contain the Nordic runes. These are not used today except by scolars, who will need special software (most likely using the "reserved to the user" part of Unicode). The same is true for many ancient Asian characters.

UCS-4 (1)

Iffy Bonzoolie (1621) | more than 13 years ago | (#175377)

Isn't this what UCS-4 is for? I can't imagine there are more than a billion characters. Of course, most of the Unicode software that deals with wide characters won't work with UCS-4. But any decent UTF-8 based program should support up to 6 bytes per character.

But I guess internally most programs use 16-bit characters, because it's easier to deal with, and just convert into more compact forms like UTF-8 when they want to save or transfer it.

Re:Duh. (1)

Malc (1751) | more than 13 years ago | (#175378)

You have to install character support for those other languages because most fonts don't contain complete coverage of the Unicode character set. If you install "Arial Unicode MS" off the Office 2000 CD, you get character support for a lot languages. Sorry, I can't remember what option to choose in the Office 2000 setup. Don't forget, Win9x/ME is multi-byte only via code pages with Unicode being a per application thing, and WinNT/2K/XP is Unicode only but with little support as all the Windows applications try to be Win9x compatible with the least amount of effort.

Use UTF-8, don't worry about sizes (3)

iabervon (1971) | more than 13 years ago | (#175380)

UTF-8 encodes 7-bit ASCII characters as themselves and all of the rest of UCS-4 (the unicode extension to 32-bits) as sequences of non-ascii characters. This means that apps which can't handle anything but ascii can simply ignore non-ascii and get all of the ascii characters (and, with minimal work, report the correct number of unknown characters).

The only issue is that there's not a good way to set a mask for the characters such that 0-127 (which take up a single byte) are the common characters for the language, and so on, so English is more compact than other languages, even languages which don't require more characters.

Re:2 + 1 bytes? (2)

Jeremy Erwin (2054) | more than 13 years ago | (#175381)

Maybe use only 20 bits and leave 4 bits for something else (font style, inverse, etc.).

Typically, one shouldn't apply font styles on a character by character b as iS.

Re:Overstating and misunderstanding the problem (2)

tjansen (2845) | more than 13 years ago | (#175383)

>>the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V No, the handwritten one in Germany looks more like the 1 in an Arial font. bye...

Re:After some skimming... (1)

Zagadka (6641) | more than 13 years ago | (#175407)

Yes, his anaology was a bit off. It would be more accurate to say "imagine if English-speakers were restricted to an alphabet which is missing characters like Æ or fi (the ligature)". While Unicode is missing lots of Chinese characters, the vast majority of the characters which are missing are characters that only historians use. One only needs to know about 2000 characters to be considered fluent in Chinese, and if you know 7000-8000 characters, you're way above average.

Re:umm (1)

Zagadka (6641) | more than 13 years ago | (#175408)

Chinese uses unicode by combinations of roots and the other parts of the characters

While many of the more complex Chinese characters do consist fo simpler radicals used in combination, they're not encoded that way in Unicode. For example, the word "ma" used at the end of many questions consists of the character for mouth and the character for horse. In Unicode, the encoding for "ma" is completely unrelated to the character for mouth and the character for horse though.

(i know that they use some other type of syllabic system for teaching the writing system, or so i recall from Chinese lessons on TV... maybe that should be used to replace the pictograph system in place now, a system which was kept by the emperors in order to keep the masses illiterate)

Just because you can't read Chinese characters, it doesn't mean Chinese people can't.

The "syllabic system" you're talking about is probaby pinyin (or perhaps bopomofo, but it doesn't really matter - they're isomorphic). Converting Chinese text to pinyin actually results in information loss. It isn't a really viable solution. The Chinese people also like their linguistic system, despite what American public schools have taught you.

Re:It works (2)

h2odragon (6908) | more than 13 years ago | (#175411)

No, that makes too much sense; it's not all inclusive so let's trash the whole thing, start over from scratch, and revert to 7bit ASCII in the meantime. We need a system that can handle every glyph that has ever had meaning to somebody, somewhere.

...for the sarcasm impaired, the above should be read as "good point".

Input devices are much more of an issue (1)

sacherjj (7595) | more than 13 years ago | (#175415)

Eventually, with much nudging along in the territories of high-resolution color and graphics, better input devices (such as the scanner, which can be thought of a fax machine for computers), better output devices such as the inkjet and laser printer, and even bastardized keyboards and software which could generate thousands of characters - if only one can remember each and every one of the input codes. Graphics tablets eased the pain of having to get something into and out of the computer. But none of this is yet fully satisfactory, and perhaps it will remain in this state until the advent of the intelligent, voice-understanding, "computer" finally comes into our daily lives.

I recently saw a story about Japanese reporters who send their stories in via a phone call and dictation on the other end, because it is so much faster that trying to get it into the computer for digital transmission. I don't think the common character representation is the only issue here. As the article states, some languages are just much, much harder to digitize.

Re:But for Java (1)

sacherjj (7595) | more than 13 years ago | (#175416)

It won't work if you try to write your Java in Mandarin... :p

What about the artist formerly known as Prince? (2)

mattkime (8466) | more than 13 years ago | (#175418)

Does the artist formerly known as Prince get his own charcter space as well?

Will I need to download a new character set on windows to view it?

Re:After some skimming... (2)

K. (10774) | more than 13 years ago | (#175424)

I suspect Unicode is a lot more upsetting to
a "reference writer specializing in rare Taoist
religious texts and medical works" than to
ordinary Chinese users who want to run Photoshop
or put their wedding pictures on a web page.

Let me get this straight - you think people
should be prepared to accept having restricted
access to the literature that underpins their
culture in exchange for their very own


Some errors (5)

BJH (11355) | more than 13 years ago | (#175437)

Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.

In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...

That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."

Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.

Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.

After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.

Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)

The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.

That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:

- No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.

- A draconian unification of CJK characters.
The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.

- The ugly "extensions".
Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.

I could go on, but I should get some sleep...

printing [ Chinese ] vs dictionary [ Chinese } (2)

peter303 (12292) | more than 13 years ago | (#175441)

Its a lot like the Oxford English Dictionary
versus Websters Collegiate- Chinese printers have
gotten by with 7-10K characters versus the 60-80K
in the full language. Synonyms and hononyms are
used for the more obscure words. The standard
modern Chinese dictionaries only have this smaller
number of characters.

totally unconvinced (2)

kaisyain (15013) | more than 13 years ago | (#175443)

One of the author's main propositions seems to be that Communist Chinese and Taiwanese/Overseas Chinese want different spaces in Unicode for the same characters.

I don't see every Western nation asking for it's own encoding of "w" or accented characters. The author doesn't give any explanation for why we should pay attention to IMHO silly political whining in this particular case.

The author further implicitly assumes that it is reasonable to include the deprecated K'ang Hsi characters in addition to the official characters, but gives no justification for this view. I don't see unicode trying to include all possible historical graphings of Western characters.

Re:Solution - Everybody use Euro-English! (1)

augustz (18082) | more than 13 years ago | (#175446)

Funny funny.... why does this post get lamed?

Re:another drawback of unicode (1)

Ole Marggraf (20336) | more than 13 years ago | (#175448)

Actually,someone thought about hieroglyphs in UCS (this was mentioned in the quickies section [] some time ago):

[] 16 37.htm

I don't know whether it is/will be implemented at the end. Looking at the limited character space, probably not.

Re:UTF8 (1)

lordpixel (22352) | more than 13 years ago | (#175456)

no. not only 8 bits.

Variable length encoding, 8, 16 or 24 bits depending on how common the character is
Lord Pixel - The cat who walks through walls

Re:UTF8 (1)

lordpixel (22352) | more than 13 years ago | (#175457)

Indeed, but the character numbers were initially orderded in something resembling common usage. At least, all of the 1 byte characters correspond (very closely or exactly) to latin-1 encoding, of which the first half corresponds to ascii, of course

This is a pretty good assumption on probability of occurance today's Internet, I won't try to predict the future :)

Lord Pixel - The cat who walks through walls

This is so wrong (2)

jfedor (27894) | more than 13 years ago | (#175466)

The author of the article and the guy who submitted the story clearly don't have a clue about Unicode. Unicode can encode over one million characters, as stated here [] .

Unicode may have its problems, but this is not one of them.


Compaction and Traction (2)

JJ (29711) | more than 13 years ago | (#175469)

The 64,000 should suffice. Ideographic scripts, like Chinese are were the problem arises. The number of characters in Chinese is not fixed, unlike the number in most alphabets. I have a Chinese novella which was written in just 300 characters. 10,000 would be a good place to start, a few thousand more would cover all but specialized texts. Japanese could fold into Chinese, since there are only 2000 kanji characters and a few hundred kana.
Throw in Arabic, Cryllic, Sanskrit, Dravidian, Hangul (Korean) and Navaho and you still add only a few thousand. The odd European characters (the 'ss' in German, the extra Danish vowels, . . .) add a few hundred tops. Even the special linguist marks and punctuation don't add much.
If you have to double the Chinese, now you run into trouble. Its classical characters vs. simplified. The later is for the PRC. If you also bloat the number of characters required so that specialized religous characters are required, now you start to push the system. 64K would be fine if a special marker character could be used which signify's that the next character is from the special table. Unicode has resisted this effort.

Nordic Runes? (2)

reverse solidus (30707) | more than 13 years ago | (#175475)

Nordic Runes. []

Re:Is this really such a problem? (2)

revscat (35618) | more than 13 years ago | (#175488)

Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent.

It's totally irrelevant for poor rural populations, true. But as more and more of the world's population moves towards being centered around urban areas this is indeed relevant. It is relevant to those who desire the full functionality of the Internet in their native character set. I believe (and this is a belief, not a fact) that one way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible. One way to do this is by making sure they can use the latest technology in their native tongue, lowering the slope of the learning curve.

Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language.

This is indeed tragic, but it quite simply cannot be helped. It's so common as to be a cliche: "Life Sucks", or "Shit Happens", or even "C'est l'vie." I hope that there are linguists and philologists who are archiving these languages for future generations and our general cultural awareness. BUT: People must eat, and they have a strong desire to make themselves and their families prosperous. If, when all things are considered, making sure that you live your life only speaking language X turns out to be counterproductive, then that language will become less important. There have been many languages that have come and gone throughout the millenia; humanity continues to advance. Would the world be a richer place if all those languages were still around? Certainly. But it would also be more confusing. And remember: If people can speak to each other, there is less of a chance they'll start killing each other. (LESS of a chance, mind you.)

I'm a Taoist at heart in matters such as this. For every yin, there is a yang, for every good, there is a bad. Life goes on.

- Rev.

It works (2)

Kohath (38547) | more than 13 years ago | (#175491)

Imperfect != "does not work"

Re:Alrighty (2)

hernick (63550) | more than 13 years ago | (#175512)

Have you ever seen an IME ? The program a Japanese person would use to enter their 10,000 characters ?

You spell out the word phonetically, and press space as you complete each word - the computer will show possible kanji, and you can cycle through them with the space key.

It actually works pretty well. Their keyboards pretty much look just like ours.

UTF-8 should be fine for almost any application (2)

AdamBa (64128) | more than 13 years ago | (#175514)

The purists who want 4-byte characters go beyond just wanting to allow 50,000 Kanji or insisting that Japanese and Chinese Kanji with the same stroke pattern not share the same character. They want a separate character for the English lower-case 'e', the French lower-case 'e', the German lower-case 'e', etc. This is not at all necessary. YES, there may be some Kanji that fall out of use if the set listed in the Unicode standard becomes the only one used, but you have to counter that with the fact that suddenly these languages can have a universally-recognized way to encode them, as opposed to the 5 of whatever ways that previously existed to encode Japanese (which all had limited character sets anyway).

UTF-8 is very nice because 7-bit characters encode as one byte. Also it is defined so there won't be a NULL or a hex 01B (decimal 27 -- the telnet escape character) anywhere in the data stream, even in the second or third byte of an encoded character. So it will generally be passed through correctly by programs expecting straight 8-bit ASCII. UTF-8 is also encoded and decoded via a trivial algorithm, as opposed to the DBCS used in Windows which needs lookup tables.

One negative of UTF-8 is that Unicode characters at 0x8000 or above (using more than 11 bits) encode in UTF-8 as 3 bytes, not 2 as in Unicode. I think that range includes things like Arabic and some Indian written languages. But I think that tradeoff is worth it.

- adam

Oh dear - error ridden (1)

orblee (66225) | more than 13 years ago | (#175515)

This article, although pointing out something I didn't know - that of there being 170,000 characters in existence on this planet, is wrong about Unicode. Firstly, there is now a UTF-32 standard which should be able to deal with just about everything, but that aside, there are more than 65,536 possible characters in UTF-16.

Out of those 65,536 possible characters, 20,000 characters or so are reserved so that we can use pairs of words to double the set. In fact, the specification allows us to add to it almost ad infinitum, continuously adding more characters. Just two words covers over 100,000 characters, three will do the lot.

Okay, the writers of unicode may have been slightly short-sighted, but also they probably considered the problems of using a 32-bit character set and decided against it (and 24-bit for that matter). They have added an extending property to the 16-bit unicode standard and that should cope with much. I don't know how the chinese/japanese/korean population deal with their HUGE character sets now (you couldn't have a keyboard big enough) but they must have a shorter, simpler method of coping with everyday data input. Surely, this double and triple pairing of UTF-16 will do?

Re:After some skimming... (1)

CloudWarrior (75680) | more than 13 years ago | (#175524)

Let me get this straight - you think people should be prepared to accept having restricted access to the literature that underpins their culture in exchange for their very own
People can already have access to their literature - you put it up as an image file. The only advantages of using text instead are that it allows you to search and edit it. Since it's classic literature, you don't need to edit it, and since there isn't a sensible way to input the characters, there wouldn't be a sensible way to search it in any case.

CloudWarrior .o. "I may be in the gutter but I look to the stars"

Re:It works (2)

TommyW (75753) | more than 13 years ago | (#175525)

True, but "does not work" implies "imperfect."

The point the article is making is that this system cannot be made to work for everybody at once.

So you either put up boundaries, and have systems
that work perfectly, but only within those boundaries, or you need a system with wider scope at the outset.
Too stupid to live.

Flamebait :) (3)

phunhippy (86447) | more than 13 years ago | (#175532)

Learn english.. 26 letters 10 numerals.. assorted punctuation.. ;)

Re:After some skimming... (2) (87560) | more than 13 years ago | (#175534)

Special letter forms don't need to be coded into unicode to be viewable. SVG, Postscript and other languages do a perfectly good level of presentation. So unless you can convince me that a Korean/Chinese person will be trying to do a word search through an historical Japanese/Taiwanese/Vietnamese document and will always inadvertently find the Korean ACK/Chinese SPOO when what he was really looking for was the Japanese FOOFLE/Taiwanese FLUM.

Personally I can't understand why anyone in the world would want to search in a character set of more than 60,000 characters. I'd personally be pissed off if the UNICODE committee started adding special letter forms for US product trademarks (so they would render correctly) when as a user I'd rather just have them be findable.

Really, the author needs to understand the use of the ALT tag.

How simple is English? (1)

The Trinidad Kid (96681) | more than 13 years ago | (#175543)

Sorry to rain on your parade but English is not a particularly simple written language.

The western tradition uses 2 complementary (but distinct) alphabets - the Latin, Majescule or upper case alphabet and the hunnish, Miniscule or lower case one.

These 2 alphabets have a 100% redundancy between them, and about a 50% overlap and their mixed-usage is context dependant and purely conventional and dates from the rennaisance. Their usages prior to that were in substantially non-overlapping geographical areas (and/or time periods).

In addition to this the English tradition chucks in an ideogram set to represent numbers, except that unlike the latinate or hunnish alphabets, this ideogram set reads right to left like the Arabic from whence it was bodged.

So, let's recapitulate, 2 alphabets with 100% semantic redundancy and 50% overlap of form which read left to right, and an ideogram set that reads right to left. Simple? Or just what you are used to?

Re:UTF8 (2)

Mendax Veritas (100454) | more than 13 years ago | (#175553)

No, because the aliens are all so technologically and socially advanced that they've standardized on Esperanto.

Wrong, wrong! (4)

Mendax Veritas (100454) | more than 13 years ago | (#175554)

UCS-2 is not the only form of Unicode, and it's well known that 64k characters isn't enough. Besides, why should ordinary ISO-8859 (Latin-1) text be doubled in size by making every character 16 bits? UTF-8 is a much better solution, and it is good enough. Granted, string handling with variable-length characters is a bit of a pain (especially if you're used to assuming that a buffer of N bytes is long enough for a string of N characters, or you want to scan the string backwards), but it's the best solution we've got. It's the recommended encoding for XML documents, and is used today in web browsers (check out that "Always send URLs as UTF-8" option in Internet Explorer).

It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.

Re:UTF8 (1)

egomaniac (105476) | more than 13 years ago | (#175561)

*sigh*. No.

UTF-8 is an encoding format, which specifies a means of encoding Unicode characters using variable-length byte sequences. The number of bytes it uses to encode characters does not dictate how many characters Unicode supports.

Unicode, as I've stated elsewhere, supports a little over a million characters. There are ~50,000 characters in Plane 0, and 2^20 (~1 million) in Plane 1. Plane 1 is made up of surrogate pairs, which are two special characters next to one another (a high surrogate and a low surrogate). There are 1024 of each, leading to 2^20 Plane 1 characters.

Re:Unicode has this covered. (1)

egomaniac (105476) | more than 13 years ago | (#175562)

No, the private use area is inappropriate for this sort of thing. Private use characters are (as the name implies) not intended to be visible to other applications; they are for encoding weird data within a single application.

There is a much larger block of public code points, which allows for over a million characters (none of which have been assigned yet, but the code points are there).

unicode does *not* encode 65,536 characters (4)

egomaniac (105476) | more than 13 years ago | (#175565)

It encodes over one million codepoints, actually (the erroneous statements of other posters notwithstanding). All currently assigned Unicode characters exist within the basic Unicode Plane 0, as it's called, which handles ~50,000 characters. Twenty-some-odd-thousand of those characters are in the CJK block (Chinese, Japanese, and Korean characters).

Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).

So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.

Unicode (1)

ralmeida (106461) | more than 13 years ago | (#175567)

Check this link to see why unicode characters won't work on the internet:



2 + 1 bytes? (1)

whovian (107062) | more than 13 years ago | (#175568)

Hm. log(170000)/log(2) = 17.4, so at least 18 bits is needed, as I cursorily understand this, to encode present human languages. Clearly a 3-byte unicode standard is needed. Maybe use only 20 bits and leave 4 bits for something else (font style, inverse, etc.).

Unicode has this covered. (3)

tjwhaynes (114792) | more than 13 years ago | (#175573)

Had this researcher bothered to read the Unicode technical introduction, the following would have been obvious.

In all, the Unicode Standard, Version 3.0 provides codes for 49,194 characters from the world's alphabets, ideograph sets, and symbol collections. These all fit into the first 64K characters, an area of the codespace that is called basic multilingual plane, or BMP for short.

There are about 8,000 unused code points for future expansion in the BMP, plus provision for another 917,476 supplementary code points. Approximately 46,000 characters are slated to be added to the Unicode Standard in upcoming versions.

The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

Plenty of room.


Toby Haynes

Bummer (1)

SpanishInquisition (127269) | more than 13 years ago | (#175595)

I always wanted to have greek letters AND hebrew letters AND a smiley face in my email address.

All Character sets simultaneously?? (1)

-tji (139690) | more than 13 years ago | (#175603)

Why would one want to represent all character sets of the world simultaneously??

In the WWW, doesn't the HTTP header contain character set information, so the client knows which of the many character sets/languages to use? Then, only the size of that one character set is important (which will always be FAR less than 64K).

UTF8 (2)

Srin Tuar (147269) | more than 13 years ago | (#175610)

UTF8 is cabable of encoding up to 31 bits per character, which is 2,147,483,648 distinct glyphs. This should be plenty for all languages, and at least for linux/*nix, it is well recognized as the way to go.

One upside of it is that that is almost no cost for english/ascii, which will remain 1 byte per character. You dont even have to recompile most apps to support it- only those that format character glyphs.

You bring up a good point (2)

Srin Tuar (147269) | more than 13 years ago | (#175611)

Does anyone know a a real language that has a simpler writing system than english?

Almost every other european language I have seen uses some set of accent marks or diacriticals. And having studied japanese and vietnamese, they have orders of magnitude more complexity. Even esperanto has a larger alphabet than english.

Is it just a coincidence that the simplest writing system was the first to be digitized? Too bad pronunciation of english isnt equally simply.

(reply to AC) (2)

Srin Tuar (147269) | more than 13 years ago | (#175612)

Wrong, look here: unicode faq []

quote: All possible 2^31 UCS codes can be encoded.

Re:After some skimming... (1)

kurisuto (165784) | more than 13 years ago | (#175620)

The author's example of treating "M" and "N" as the same character is just plain wrong. Changing one for the other changes the meaning (e.g. "smack" vs. "snack"). Not so for the various presentation forms of a single Chinese character.

A better example would be the ampersand character (&). I can think of several ways to write that character, but I challenge anyone to come up with a sentence where changing one presentation form of the ampersand for another changes the meaning of the sentence.

Re:All Character sets simultaneously?? (1)

kurisuto (165784) | more than 13 years ago | (#175621)

I have to represent multiple character sets all the time in my line of work (linguistics). For example, my dissertation included Roman, Greek, Cyrillic, IPA, and Runic characters, among others.

You're correct that it is possible to mark ranges of text as belonging to a particular character set, but there are many drawbacks to this solution. For example, I was recently trying to grep Greek words from a text in both Greek and Latin; both languages were encoded in the same character space, with tags to show what text was in what language. I got all sorts of spurious matches from the Latin words, which wouldn't happen if the Greek and Roman letters weren't sharing a single character space.

There are workarounds, but they are a huge hassle for anyone who has to regularly work with multilingual text.

Re:Is this really such a problem? (1)

kurisuto (165784) | more than 13 years ago | (#175622)

Your arguments fail on languages such as Russian, Greek, Arabic, Hebrew, Hindi, etc. These languages use non-Roman character sets, but each has many millions of speakers; none is faced with language death in the foreseeable future. None of these languages can be considered politically or economically insignificant.

Even if a language does die, there is often still a need to work with it. For example, I specialize professionally in various pre-modern languages, some of which have not survived to the present. I still need a way to encode these languages as I use computers to produce dictionaries and online corpora.

Re:another drawback of unicode (2)

kurisuto (165784) | more than 13 years ago | (#175625)

There is in fact a group working on Unicode encodings for the Egyptian heiroglyhic character set. The codes will go in the "surrogate characters" range of Unicode. Regular Unicode uses the codes between 0 thru 2^16-1; the surrogate range runs from 2^16 thru 2^32-1, and has been designated by the Unicode Consortium for exactly this kind of case, i.e. large, rarely used characters sets.

Overstating and misunderstanding the problem (4)

kurisuto (165784) | more than 13 years ago | (#175628)

This article mischaracterizes the issue concerning the Chinese characters. To take a western example as an illustration, the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V. However, folks in America and Germany agree that this is "the same character"; we simply have a different way of writing it. Unicode recognizes this sameness by assigning the same code for character for "one"; the way to display it locally is a presentation issue, not an encoding one.

This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.

For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.

Re:All Character sets simultaneously?? (1)

OhPlz (168413) | more than 13 years ago | (#175635)

What if you wanted to view two documents at once and their mappings conflicted? Granted present day HTML restricts a document to one character set that may not be the case forever (or maybe it's not even the case now, what version HTML are they up to now?).

Re:Solution - Everybody use Euro-English! (2)

Skuto (171945) | more than 13 years ago | (#175636)

a) This is (adapted?) from Mark Twain, it's in
most fortunes.

b) No matter how funny it looks, if you read
it aloud its prefectly understandable...

c) ...but it keeps reminding me of 'Allo Allo?'


Solution - Everybody use Euro-English! (4)

saider (177166) | more than 13 years ago | (#175638)

The European Commission has just announced an agreement whereby English will be the official language of the EU rather than German which was the other possibility. As part of the negotiations, Her Majesty's Government conceded that English spelling had some room for improvement and has accepted a 5 year phase-in plan that would be known as "Euro-English".

In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.

There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.

In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.

By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.

After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!

Re:No it's not (1)

Pru (201238) | more than 13 years ago | (#175643)

>>>>> Yeah, let's get down to the lowest common denominator and make laws that require all internet content to be at least in X number of languages. A bit like the silly EU regulation because of which every fucking document, web site and audio recording concerning the Union must be available in at least. You have to be kidding right? Laws that goverent the internet? isant that dumber then multipul languages?

Re:After some skimming... (1)

Decado (207907) | more than 13 years ago | (#175649)

Well since the language used for Irish uses just 18 of the 26 letters of the english alphabet and gets by just fine I assume it wouldn't matter that much. For the record the missing letters are j, k, q, v, w, x, y and z

Prejudice? Or technical hurdle... (1)

fleeb_fantastique (208912) | more than 13 years ago | (#175652)

I find myself somewhat frustrated with the viewpoint that western programmers "discriminate" against other cultures because the culture has too many characters, where "discriminate" implies a political, social, or personal conflict. The problem, frankly, seems more technical in nature to me.

Many operating systems have a design that uses a smaller character set, if for no other reason than to help conserve space. Take your average file system; the character set doesn't permit Unicode characters in most cases, and even the C++ STL doesn't have a spec for streaming files with wchar_t names.

Then consider that you have several evolving programs that have to be modified to use a different character set. From experience, I can tell you that, particularly for complex programs, this is not a trivial job.

Finally, imagine that a political body imposes a deadline on imported programs.. that they must support their new standard by such-and-so a date or it won't be permitted within the country. The Chinese did this, extending the deadline to Sept. 2001. I only found out about this yesterday.

It doesn't make a job easier.

Is this really such a problem? (1)

Dan Hayes (212400) | more than 13 years ago | (#175657)

While it's a noble and practical goal to eventually allow every language to be rendered as part of a website to allow for maximum access, I don't think that this limitation will really be much of a problem in the long run for two reasons.

Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent. And in many of these places, political and economic instability caused by civil wars, corporate greed and a lack of resources will mean this situation will continue for some time.

Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language. Cultural assimilation has proceeded at a brisk pace, with western countries only too willing to help with the "modernisation" of other cultures, which invariably results in a loss of their original heritage and linguistic uniqueness. And already globalisation is turning English into the de facto second language of the world.

By the time the 65K limit would become a problem, I estimate it won't be a problem any more - there will be far fewer languages around, and only a subset of those will require online access. If all else fails, many of the majority remaining will speak English anyway.

Mac OSX has character mapping problems (1)

wmulvihillDxR (212915) | more than 13 years ago | (#175658)

In using OSX from the start, it has always had problems mapping characters. Even the "normal" weird ASCII characters get mapped strangely. Upside-down question mark is one. I could deal with it changing, but what frustrates me is that it doesn't change back whenever you go back to other UNIX systems. For instance, downloading a text file with weird ASCII characters with OSX's scp will make things go awry. But then transferring that file back up does not switch it back. Weird stuff!

Re:After some skimming... (1)

update() (217397) | more than 13 years ago | (#175659)

I don't presume to say what people should accept.

I'm stating my impression about what they do accept (that Chinese users and standards bodies are far less troubled about Unicode than is the author) and speculating on why that might be (that out-of-the-box support to edit ancient texts in Word is more important to a scholar than to the vast majority of users).

Unsettling MOTD at my ISP.

After some skimming... (4)

update() (217397) | more than 13 years ago | (#175661)

I planned to read this through before posting. I really did. But then, in the second paragraph I hit:
Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
and I decided to skim the rest.

To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.

Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

Unsettling MOTD at my ISP.

In other news... (3)

ackthpt (218170) | more than 13 years ago | (#175662)

Bush bolts GOP to join Democrats, fires entire Whitehouse staff

Linus Torvalds to join Microsoft as OfficeXP advocate

NASA on Moonshots, "Ok, ok, they were all actually faked on a soundstage in Toledo, Ohio and the ISS is really in a warehouse in Newark, New Jersey"

Oracle CEO, Larry Ellison to give fortune to charity, dumps japanese kimonos for Dockers and GAP T-shirts

RIAA to drop all charges against Napster, "All a big fsck-up, we'll all get rich together"

Taiwan throws in towel, joins PRC, turning over massive US military and intelligence assets

Rob Malda signed by Disney, epic picture planned, based upon this short. Sez Malda, "Anime's not mainstream enough anyway." []

All your .sig are belong to us!

So... (4)

ackthpt (218170) | more than 13 years ago | (#175663)

4av3 3v3r0n3 1n t4e w0r1d 13arn t0 typ3 l33t!

All your .sig are belong to us!

ASCII stupidity all over again... (2)

Matthias Wiesmann (221411) | more than 13 years ago | (#175664)

It's not new, and alas not surprising.

When they did ASCII, it was a standard by the US, for the US, the mess it created in the high-ascii range (128-256) is still not resolved and I'm talking diacritical characters like those used in western european languages (French, German, Spanish etc...) nothing fancy or very exotic. Problem was, of course the europeans were not implied in the process.

Now they do a universal standard that should correct all problems and surprise, they don't actually bother to check with the implied persons. Even if they did, it would make sense to have provisions for a few unknown character sets (like ancient civilisations or the myriad of small groups of people living in lost parts of the world).

Anyway, if computer history has told us something, is that a 16bit range is never sufficient for practical uses. Well, just another sad example of one size does not fit all... But I suppose the slashdot response will be - why the hell don't they all speak/write english...

Re:Duh. (1)

devnullkac (223246) | more than 13 years ago | (#175665)

I disagree. I don't see Unicode (or its alternatives) as a way to resolve language barriers. Rather, it defines a framework within which all programmers can use the same libraries and programming languages to develop applications using their own language.

To use a gardening analogy: it doesn't make us all plant the same things or help us understand the meaning of what someone else has planted; it just lets us all use the same tools for working in our gardens.

Alrighty (2)

rabtech (223758) | more than 13 years ago | (#175666)

The guy obviously has an anti-western mindset.

But to simplify, the crux of his argument seems to be that in order to read ancient works from the Chinese/Japanese/etc, they need about 40,000 to 50,000 characters each.

But in reality, the average Japanese person would use less than 10,000 characters. In fact, probably much less.

Besides -- it is mostly a moot point until you can show me a keyboard capable of entering 50,000 unique symbols efficiently.

His solution seems to be allocating 32-bits of storage per character, rather than the 16-bit Unicode standard we have now.

For the forseeable future, it would seem that Latin-esque alphabets have the upper hand. It just makes more sense, especially in terms of programming and protocols. Do we really need web servers that understand how to read "GET / HTTP/1.1" in thirty different character sets?

-- russ

umm (1)

zephc (225327) | more than 13 years ago | (#175667)

isnt that why there are all those different encodings, so there IS no overlap? i dont think there is any language that cannot use the 65K characters to construct its written language on a computer.

kanji has what, 3000 characters? Chinese uses unicode by combinations of roots and the other parts of the characters (its kinda complex, but it works! :P)

So what is the problem? I see no conflicts in any encoding scheme, where even really complex ones like Chinese still work?

(i know that they use some other type of syllabic system for teaching the writing system, or so i recall from Chinese lessons on TV... maybe that should be used to replace the pictograph system in place now, a system which was kept by the emperors in order to keep the masses illiterate)

Re:You bring up a good point (1)

zephc (225327) | more than 13 years ago | (#175668)

esperanto may have a slightly larger alphabet, but its has perfectly logical and regular rules for spelling as well as grammar, and is FAR easier to learn than english

Re:After some skimming... (2)

bmongar (230600) | more than 13 years ago | (#175672)

Of course the display sets could develop dipthongs, ie more than one unicode char to represent a Chinese character. Part of the problem with the Chinese character set is that it is not an character set so much as a dictionary, with words having only one character, and no restriction in adding new ones. So don't make a single character for each of them, use letter combinations. OF course that is my western bias.

A tough problem... (2)

RareHeintz (244414) | more than 13 years ago | (#175675)

This is a problem that the Chinese gov't has realized in the past, and the development of the Pin-Yin phonetic romanization system was originally started with an eye toward phasing out the (admittedly more cumbersome, but significantly more beautiful) ideograms. (Of course, they had no idea about the Unicode issue back then, but I'm speaking of the larger issues that having a huge, ideogrammatic written language, of which the Unicode problem is just a new manifestation.)

I don't know where these plans for conversion to a phonetic written language stand now, though I'm sure it wouldn't be hard to find out.

- B

Duh. (2)

Shoten (260439) | more than 13 years ago | (#175684)

This should be obvious to anyone who has ever looked at a unicode chart [] or has had to click "Cancel" when asked to install character support for any of the myriad languages that need language packs to be displayed in Windows. Ok, so they built a way to theoretically support all of these characters. This does not mean that I can read Japanese, however, and making it possible to see it in my browser will not change that fact, nor will it make Google searchable in Japanese, cause IRC to support katakana or hiragana characters (and just freaking forget kanji unless you want to chat with a graphics tablet). Unicode has purposes (besides making it easier to hack web servers, that is), but the hopes and dreams built around it are a classic case of throwing tech at a social barrier to try and make it go away.

Re:Is this a problem? (1)

Hektor_Troy (262592) | more than 13 years ago | (#175685)

Why not have the standard language be chineese? It's spoken and written by more people than english, so that would make more sence.

But for Java (1)

Husaria (262766) | more than 13 years ago | (#175686)

They made it work for Java, I'm sure a interpeter for HTML, ASP, could be worked out using unicode.
Download once, read anywhere

4 bytes per character? (1)

oogoody (302342) | more than 13 years ago | (#175690)

Don't think so.

Unicode Surrogates (1)

Mumbly_Joe (302720) | more than 13 years ago | (#175691)

I've been writing Unicode code lately (using UTF-8 encoding) and I've been reading the 3.0 standard.

The Unicode standard supports surrogates, which are pairs of 16-bit code points. These pairs defines about an additional 1 million code points within the standard. A "code point" is a unique value for some character.

There is plenty of room in the Unicode space for all the characters.

Re:Pictographic icons are not letters! (1)

Alanus (309106) | more than 13 years ago | (#175697)

In Japanese house has 10 strokes. The problem is that the strokes cannot be mapped to individual "characters" since the individual strokes have no standardized position, direction or size. To encode the strokes would take a lot of information (a bitmap would probably be easier).

Re:Alrighty (2)

vidarh (309115) | more than 13 years ago | (#175698)

Input methods for Chinese, Japanese and Korean exists, and can efficiently handle the number of characters required. Some do it by typing out the romanized sound, and mapping it to the characters.

And actually, the "Unicode standard we have now" does not fit in UCS-2 (16 bit). It requires one of the UTF-* encodings (which are variable length encodings), or UCS-4 (32 bit).

As for his gripes about Unicode 3.1, sure, there are things you can't write with it. But it's a good step forward. And it doesn't fill the entire glyph-space, by far. The 32 bit encodings, because of the way they are arranged can "only" handle about a million characters if I remember correctly, but that is still way more than is needed.

Re:ASCII stupidity all over again... (2)

vidarh (309115) | more than 13 years ago | (#175699)

Get your facts straight. Unicode isn't written in stone. It is an evolving standard. And one of the reasons it is taking so long is precisely because everyone affected can get involved - there's been a lot of infighting about which glyphs should make it and how to organize them. The result, however, is that most commonly used scripts can be handled by the current version of Unicode. More will most likely be handled in the future.

Re:totally unconvinced (2)

vidarh (309115) | more than 13 years ago | (#175700)

One of the reasons they want different glyphs is that the characters actually look different in present day use.

As for including all possible historical versions of Western characters, there are very few that are sufficiently different from present day renderings to be easy to confuse.

But I agree that his criticism is mostly whining. Most of all because Unicode 3.1 has shown that unicode absolutely is not a static standard, but one that is evolving to encompass more characters on a regular basis. Perhaps some people will have problems using it today. In that case those people should interact with the standards committee instead of whining, and get their characters into the next version.

But for most people (including most Chinese and Japanese people) the current Unicode standard will be comprehensive enough for most use.

Re:2 + 1 bytes? (2)

vidarh (309115) | more than 13 years ago | (#175701)

Uhm. Unicode already have at least four representations that allow for about a million characters each: UTF-8 (8 bit for US-ASCII, 2-4(?) bytes for everything else), UTF-16 (usually 16 bit, 32 bit for alternate "planes") and UCS-32 (32 bit).

In other words, the limitation currently isn't lack of space in the Unicode encodings (unless you use UCS-2), but the fact that they simply haven't gotten around to specifying any more characters yet - unicode is still a work in progress.

Re:Uh, I Don't Get It (2)

vidarh (309115) | more than 13 years ago | (#175702)

UCS-4 is not a character set. It is an encoding of Unicode, similar to UCS-2 (UCS-2 is 16 bit, UCS-4 is 32 bit), and UTF-7, UTF-8 and UTF-16 (variable lenght encodings).

Except for UCS-2 (and perhaps UTF-7? I don't remember), all of them can encode about a million glyphs (the reason it's not more is due to the way the codespace is laid out, separating things in "planes", and reserving a lot of space for private use etc.)

Hmm.. I must have been using something else then? (4)

vidarh (309115) | more than 13 years ago | (#175711)

I've been using Unicode in various incarnations for a long time. And UCS-2 is not the only way to encode Unicode. UTF-8 is perhaps a lot more widespread, as it is the defacto standard encoding for exchange of XML documents over the web.

UCS-4 is also quite common, and allows for the new extensions.

UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.

Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.

Re:More Flamebait :) (2)

tb3 (313150) | more than 13 years ago | (#175712)

Klingon into Unicode? I knew those people were obsessed, but that's just asinine! Fictional languages shouldn't even be considered, where would it end?

"What are we going to do tonight, Bill?"

Re:All Character sets simultaneously?? (1)

Ubi_NL (313657) | more than 13 years ago | (#175713)

Is that a problem? I mean a *real* problem? Having my documents in unicode means my files get 18 times larger than 'normal' only for the off chance I want to put a funny character in.

There are better ways for this. Why not just put a language directive in the header of a HTTP file?

More Flamebait :) (3)

bark76 (410275) | more than 13 years ago | (#175724)

Maybe if people didn't try to get character sets like Klingon [] , Cirth and Tengwar [] added into unicode we wouldn't have this problem!

another drawback of unicode (3)

Magumbo (414471) | more than 13 years ago | (#175729)

And we must not forget about hierogliphics. Unicode certainly has forgotten about them. That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.


I'll take that challenge (1)

MarkusQ (450076) | more than 13 years ago | (#175736)

First off, I agree with you. But you post such an interesting challenge I can't resist:

A better example would be the ampersand character (&). I can think of several ways to write that character, but I challenge anyone to come up with a sentence where changing one presentation form of the ampersand for another changes the meaning of the sentence.

How about:

"To delimit a path, *nix uses as slash, whereas MS* uses a backslash; if you get these confused it helps to remember that ampersand is a rounded "E" with a slash through it."

Contrived, I will admit, but I think it answers your challenge.


Yes, it is (1)

absurd_spork (454513) | more than 13 years ago | (#175738)

Just because English is the most popular language on the Internet at the moment, that doesn't mean that either other languages were not used or that other languages might not take over that role in the future. If, for example, the growth of Internet accessibility in China keeps up at that rate, Chinese will be language #1 in the Internet by 2007, especially since Chinese will be read and understood by Koreans and Japanese as well.

You don't really KNOW about unicode, do you? (2)

absurd_spork (454513) | more than 13 years ago | (#175739)

Honestly, you don't really KNOW about Unicode and how it works, do you?

The idea behind Unicode is to have a uniform encoding for all the world's scripts, not for all the world's languages. The necessity of this is evident for anyone who has experience with the insufficiencies of the individual codepage systems (Windows CPxxx, ISO 8859-x, ISCII etc.) currently in use. Have you ever tried to send an Arabic e-mail through a non-Arabic mailserver or run a program with German character support on a codepage 450 windows? Unicode is designed to programs and data interoperable regardless of either's language encoding.

Just because you don't know Japanese it doesn't make the rendering of Japanese pointless. Just because you don't have a clue how a Chinese or Japanese Kanji input system works doesn't render the idea of being able to chat in IRC using Japanese characters entirely pointless.

Workaround (1)

oliveloaf (454539) | more than 13 years ago | (#175740)

Why not use unicode for everyday use, and a PDF'ish format that could have every character of said language for special purposes, i.e. historical documents.

Re:Is this really such a problem? (1)

trash eighty (457611) | more than 13 years ago | (#175748)

Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access.

such as china, taiwan and japan? =P

And already globalisation is turning English into the de facto second language of the world.

this is not and never will be true nevermind how many times people repeat this fallacy

Re:totally unconvinced (1)

trash eighty (457611) | more than 13 years ago | (#175749)

its not the same thing though, and in thousands of cases the characters are not the same

Re:Is this a problem? (2)

trash eighty (457611) | more than 13 years ago | (#175753)

would you make the same comments if you were not able to speak/read english?

millions of pages of the web are not in english y'know?

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>