Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Migrate a MySQL Database Preserving Special Characters

kdawson posted more than 7 years ago | from the encoding-juju dept.

Databases 98

TomSlick writes "Michael Chu's blog provides a good solution for people migrating their MySQL databases and finding that special characters (like smart quotes) get mangled. He presents two practical solutions to migrating the database properly."

cancel ×

98 comments

Sorry! There are no comments related to the filter you selected.

smart quotes? more like stupid fucked up quotes. (-1, Troll)

Anonymous Coward | more than 7 years ago | (#19017089)

am i right or what?

Re:smart quotes? more like stupid fucked up quotes (1)

MillionthMonkey (240664) | more than 7 years ago | (#19017583)

I share your attitude. Sometimes people where I work send snippets of code and whatnot over Outlook, and it gets corrupted with smiley faces.

Re:smart quotes? more like stupid fucked up quotes (2, Funny)

frisket (149522) | more than 7 years ago | (#19018457)

Real IT professionals don't use Outlook.

Re:smart quotes? more like stupid fucked up quotes (0)

Anonymous Coward | more than 7 years ago | (#19039181)

Real IT professionals don't use Outlook.

I agree. They use Outlook Express, because it comes with every Pee Cee and is completely free.

Re:smart quotes? more like stupid fucked up quotes (1)

brunascle (994197) | more than 7 years ago | (#19018785)

speaking of which, does anyone know where i can find info on the perl module named Audio:[smiley-sticking-his-toungue-out]lay? CPAN seems to be missing this one.

Re:smart quotes? more like stupid fucked up quotes (1)

Stonent1 (594886) | more than 7 years ago | (#19039219)

Send as plain text and not HTML or Rich Text.

Migration (4, Informative)

dfetter (2035) | more than 7 years ago | (#19017095)

Better still, install DBI-Link http://pgfoundry.org/projects/dbi-link/ [pgfoundry.org] inside PostgreSQL, migrate once and have done ;)

Your sig. (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#19017171)

http://www.guncite.com/gc2ndmea.html [guncite.com]

This link covers pretty much what there is to understand about the passage. Could you tell me what piece you had in mind when you asked the question?

Re:Your sig. (0)

Anonymous Coward | more than 7 years ago | (#19019017)

“Freedom is untidy.”
            —Donald Rumsfeld

Remember, kids: Violence and mayhem [cnn.com] are symptoms of our nation's greatness. May the gun-control terrorists never rob us of our untidy freedoms, nor our foreign colonials of theirs.

Re:Migration (-1, Offtopic)

Anonymous Coward | more than 7 years ago | (#19017409)

Indeed, but the creators of Enterprise completely painted themselves in a corner. They had a cute idea, hoping to revive the original Trekkie spirit, but they soon realized that they don't have enough material for a series here without stepping into realms already covered in the previous 'future' series'. The Borg are a good example of this.

According to VOY, the Borg have the ability to travel almost anywhere in the Galaxy in fairly short periods of time. THIS explains the outposts.

Annika was assimilated when she was a small child, 6 if I remember correctly. That was about 12-14 years before her time on voyager. This puts it, time wise, shortly after the Enterprise D's first encounter with the Borg (Q). And if you look at the Hansen's ship on VOY, its very TNG-esque in its style.

Putting the Borg on Enterprise was a bad move and totally messes up continuity. Their way around it is simply to never have the Borg identify themselves. And if this was as major an event as Enterprise made it out to be, EVERY captain, not to mention every cadet, would have known about the 'cybernetic creatures'. And one more minor thing, but something every trekkie can quote:

'We are the Borg. You will be assimiliated. Resistance is futile'

unless you're appearing on enterprise. then its:

'You will be assimiliated. Resistance is futile'

I enjoyed some of the episodes of Enterprise. But really, the whole show was a bad concept. They really restricted themselves to a short period of ST history, with all of the events on both sides of that period being known. Basically they jumped on the Star Wars prequel bandwagon. And truth be told, it didnt work that well for SW either.

If they're going to make another series, and I sincerely hope they do, it needs to be set well in the future beyond the TNG-DS9-VOY timeframe. Otherwise you get tripped up on big plot holes like this that simply turn off fans, damage the franchise, and lead to the demise of the series.

We've had enough glimpses of futures beyond the 24th century that they could start to build a framework based on those things. I can't wait to see it!

Re:Migration (0, Offtopic)

Hal_Porter (817932) | more than 7 years ago | (#19017643)

Thank you for this post.

I'm go through a very bad devorce at the moment. I've actually moved back in to my parents basement, but my soon-to-be ex wife still keeps visiting. I've tried various ways to discourage her, including putting up a picture of my new partner, Cletus and I in our matching fursuits, on the cellar door, but none has worked.

But since I put a printout of your post up, she hasn't come around at all.

If women were vampires, your post would be garlic.

Re:Migration (0, Offtopic)

try_anything (880404) | more than 7 years ago | (#19017719)

Unfortunately, there are some women who resemble werewolves, and they are attracted to Star Trek like it was made out of ice cream.

Re:Migration (0, Offtopic)

Hal_Porter (817932) | more than 7 years ago | (#19018337)

Stick a Serenity poster on your door, that'll keep them away.

Useful? This is damned awesome! (1)

Architect_sasyr (938685) | more than 7 years ago | (#19017099)

There's about 8,000 wordpress blog's out there that could use this. Pity I can't mod an article insightful

Re:Useful? This is damned awesome! (5, Informative)

jamshid (140925) | more than 7 years ago | (#19017415)

Then send the wordpress developers this link:

http://www.joelonsoftware.com/articles/Unicode.htm l [joelonsoftware.com]
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Re:Useful? This is damned awesome! (1)

frisket (149522) | more than 7 years ago | (#19018439)

But some of them Just Don't Get It [tm]. "We're Americans, we don't use fancy foreign letters, so we just want ASCII." Sigh.

--
The best cure for seasickness is to go sit under a tree. [Spike Milligan]

Joel is often awesome. (1)

argent (18001) | more than 7 years ago | (#19018727)

I would say "always" but nobody's *always* anything... but often enough I can't pull up a specific example of a Joel On Software article that's not off the top of my head. :)

Re:Useful? This is damned awesome! (2, Interesting)

Hognoxious (631665) | more than 7 years ago | (#19018933)

I'm not American, and I'm sitting here supporting a multinational IT system (Italy, Belgium, Netherlands, UK, Italy, Spain & Portugal) and it works fine without unicode. While I'm generally a fan of Joel I think he overstates the case here.

Re:Useful? This is damned awesome! (1)

Krischi (61667) | more than 7 years ago | (#19019103)

Good luck in supporting EU member countries such as Bulgaria or Greece, then. You will need it.

Re:Useful? This is damned awesome! (0)

Anonymous Coward | more than 7 years ago | (#19029653)

The problem with joel is that he's a fudge packer. And he thinks everyone is a fudge packer. Now, if you're fudge packer, fine, read joel, follow his advice, spam his links, suck off another dude, whatever. But personally, I prefer vagina.

Big Trouble in Little China. Don't use UCS-2. (5, Informative)

argent (18001) | more than 7 years ago | (#19018979)

UCS-2 only covers plane zero (the Basic Multilingual Plane, or BMP). It doesn't cover code points outside that. Unicode actually supports the entire UCS, all 1.1 million (and growing) code points.

In other words, Joel has made the same mistake as the people who wrote software that only works in 7-bit ASCII or 8-bit UTF-8 or the IBM or Apple or Adobe 8-bit extended ASCII sets or the 9-bit extended ASCII set that ITS used, or...

And it's already too late to try and cram everything into 2 bytes. After the Han Unification mess (the attempt to force Chinese and Japanese and everything else that used some variant of Chinese ideograms (Kanji, etc...) into a common subset of similar characters that fit in the 65535 available codes in the BMP) the People Republic of China decided to require their computers to support their national encoding anyway. As of 2000.

So you have to support the full UCS encoding anyway.

There's three storage formats that it's practical to use: UCS-4 (4 bytes per character, with the same byte-ordering problems as UCS-2), UTF-16 (2-4 bytes per character, same as UCS-2 for the BMP) or UTF-8 (1-4 bytes per character). Internally: you can use UCS-4 as your wide character type, and translate on the fly; use UTF-8 and use care to avoid breaking strings in the middle of glyphs or use UTF-16 and translate on the fly and use care to avoid breaking strings in the middle of glyphs.

If Joel is lucky the libraries he's using are actually operating on UTF-16 strings instead of UCS-2 strings. If he's *really* lucky they're designed to avoid breaking up codes outside the BMP. If he's *super* lucky he's managed to avoid creating any code that just operates on strings as a stream of wchar_t anywhere.

Personally, I think that UTF-16 gets you the worst of both worlds: your data is almost certainly less compact than if you use UTF-8; you still have to deal with multi-wchar_t strings so your code is no easier to write than if you used UTF-8... you're just less likely to find bugs in testing; and you get byte order issues in files just like you would with UCS-4. Unless you think UCS-2 is "good enough" and you just ignore everything outside the BMP and discover that people in China are suddenly getting hash when they use your program.

Re:Big Trouble in Little China. Don't use UCS-2. (0)

Anonymous Coward | more than 7 years ago | (#19021885)

Note that even with UCS-4 you still can't just treat the text like a stream of four-byte entities that you can chop up at will. With the possibility of decomposed characters, you still have to scan the text for multi-unichar units and avoid breaking up decomposed characters.

Re:Big Trouble in Little China. Don't use UCS-2. (1)

argent (18001) | more than 7 years ago | (#19022531)

Why can't you maintain characters in composed form internally? You would only need to convert when importing an external format, and you should be using UTF-8 for that since UCS-2 and UCS-4 have byte-order dependencies.

Re:Big Trouble in Little China. Don't use UCS-2. (2, Informative)

tepples (727027) | more than 7 years ago | (#19026021)

Why can't you maintain characters in composed form internally?
Because the set of composite characters that have a composed form differs from Unicode version to Unicode version.

Re:Big Trouble in Little China. Don't use UCS-2. (0)

Anonymous Coward | more than 7 years ago | (#19027203)

Not all decomposed character combinations have a composed equivalent. Most do, but it's possible to create legal combinations which have no single unichar replacement. Not to mention that if you did this, you suddenly need a whole bunch of code to take all incoming text and fix it up so that everything is precomposed, which once again means that you can't just treat things like bags of characters.

Re:Big Trouble in Little China. Don't use UCS-2. (2, Interesting)

argent (18001) | more than 7 years ago | (#19031977)

Not to mention that if you did this, you suddenly need a whole bunch of code to take all incoming text and fix it up so that everything is precomposed

That's no different from "you need a bunch of code to take the incoming text and convert it to UCS-2 (or UTF-8, or UTF-16, or UCS-4)".

it's possible to create legal combinations which have no single unichar replacement.

Are they meaningful as well as legal, or ar they like the "n with an umlaut" in "Spinal Tap"?

I'm of the opinion that there should be *no* precomposed characters, or they should *all* be precomposed.

In any case, you can always use a guaranteed unused code and use a lookup table on input and output.

Re:Big Trouble in Little China. Don't use UCS-2. (0)

Anonymous Coward | more than 7 years ago | (#19037267)

That's no different from "you need a bunch of code to take the incoming text and convert it to UCS-2 (or UTF-8, or UTF-16, or UCS-4)".
Actually it's quite different. If the incoming text is some form of unicode, writing a converter to convert it to another form of unicode is trivial. I'm pretty sure I could write something to interconvert between all of the formats you mention in about an hour, with minimal need for references. Converting between precomposed and decomposed characters requires large equivalence tables which are going to change depending on what version of unicode you're targeting.

Of course tracking precomposed character sets also takes large equivalence tables, but the point is that using only precomposed characters doesn't make the job any easier, even if it's not necessarily any harder.

Are they meaningful as well as legal, or ar they like the "n with an umlaut" in "Spinal Tap"?
I'm not sure, but as soon as you decide that you're only going to support some unicode because parts of it are too hard, you suddenly put yourself in a position to be incompatible with some incoming text, and presumably the reason you're using unicode in the first place is to avoid all of that. After all, maybe somebody decided to use Spinal Tap's correct diacritical in your data!

I'm of the opinion that there should be *no* precomposed characters, or they should *all* be precomposed.
I agree. And given how things work, having no precomposed characters would make the most sense, so that you avoid the combinatorial explosion. But then you sacrifice things like Latin1 code point equivalence. As usual, these things are largely defined by backwards compatibility.

In any case, you can always use a guaranteed unused code and use a lookup table on input and output.
Until you run out of such codes, anyway.

Re:Big Trouble in Little China. Don't use UCS-2. (1)

argent (18001) | more than 7 years ago | (#19041061)

Converting between precomposed and decomposed characters requires large equivalence tables which are going to change depending on what version of unicode you're targeting.

If Unicode is so broken, then...

I'm not sure, but as soon as you decide that you're only going to support some unicode because parts of it are too hard, you suddenly put yourself in a position to be incompatible with some incoming text, and presumably the reason you're using unicode in the first place is to avoid all of that.

It would seem inevitable that you're going to be incompatible with some incoming text no matter what.

I see this as not supporting some Unicode because parts of it are too stupid.

Re:Big Trouble in Little China. Don't use UCS-2. (0)

Anonymous Coward | more than 7 years ago | (#19102199)

Precomposed forms only exist for lossless round-trips to/from legacy character sets. NFC is frozen [unicode.org] , so there aren't going to be precomposed forms for any new diacritical marks. You can use NFD or NFKD to remove everything precomposed from your data. They recommend NFC (everything precomposed where possible) for the Web, though.

What you were proposing is a dictionary compression scheme to fit each combining character sequence into a single array slot, but that doesn't even buy you much. You still can't sort or display strings by splitting them arbitrarily or processing one combining character sequence at a time because of ligatures, digraphs (e.g., "ch" sorts after "h" in a Slovak locale), bidi, and weird stuff like soft hyphen and combining grapheme joiner. The library routines need to see the whole string at once to give the right answers in context.

PS (1)

argent (18001) | more than 7 years ago | (#19041229)

In any case, you can always use a guaranteed unused code and use a lookup table on input and output.

Until you run out of such codes, anyway.


You have planes 15 and 16 available. If you have a *single* object that is using more than 131072 unique combinations of composing characters I'll be impressed, and you might even have to start using the as-yet-unused 12 bits of UCS-4.

Re:Big Trouble in Little China. Don't use UCS-2. (1)

wiredlogic (135348) | more than 7 years ago | (#19021969)

Actually, Windows shifted to UTF-16 (with full surrogate pair support) as of Win2K. From the description in Joel's article it would seem that he is relying on the Windows string APIs and is safe unless his customers are trying to run on older versions.

Re:Big Trouble in Little China. Don't use UCS-2. (2, Informative)

argent (18001) | more than 7 years ago | (#19023047)

Assuming he's ONLY using Windows string APIs.

First, you need to be religious about it. But if you are, then the choice of internal encoding is really a performance issue only, and the choice of external encoding is a matter of following the principle of least astonishment. Your code shouldn't know nor care what encoding the string APIs use internally. The program should work the same whether wchar_t is (unsigned char), (unsigned short), (unsigned long), or even (double).

Second, there's a lot of overhead in canned string code. Depending on the problem space that may be OK. For short strings or simple operations, or where there's significant per-token overhead otherwise, the overhead of dipping into the API for each character isn't significant. When performance matters, though, even inlined code can slow the critical path in the program down orders of magnitude. Getting decent performance requires a parser be character-set aware.

So the advice should be:

1. "Use a canned string handling library that's unicode-aware". That means using at least the ISO C wide character libraries and NOT depending on implementation details like whether they're UCS-2 or UTF-16 or UCS-4 or UTF-8 or for god's sake UTF-1 internally.
2. "Only export data in UTF-8".
3. "When performance matters, figure out what's most efficient for your problem space, and use that." And if you're not sure, performance probably doesn't matter as much as you think it does. And don't forget that cache matters and branches cost dozens of instructions.

Re:Big Trouble in Little China. Don't use UCS-2. (0)

Anonymous Coward | more than 7 years ago | (#19024473)

Your comments vis-a-vis UTF-8 vs. UTF-16 are based on a Western bias. UTF-16 is much more efficient at encoding the vast range of BMP characters; UTF-8 is much more suitable for what I think of "ASCII+": the ASCII set, with extensions to encode the rest of the Unicode characters. It's a simple trade-off of any variable-length encoding scheme. You can make some encodings shorter, at the expensive of making many others longer.

In general, UTF-16 is recommended for multilingual applications, and it's what Java, Windows, etc. use internally for wide character data, since it's not quite as wasteful as UCS-4, and it's more efficient than UTF-8 for the majority of useful characters. UTF-8's strong suite is as a UTF which is mostly compatible with the enormous number of ASCII systems that can take some form of 8-bit character. (Then there's the abortion known as UTF-7 for those that don't, but Base64 seems like a better solution in these cases. :))

I was planning on adopting UCS-4 for all of my internal strings, but I've had second thoughts about it. Unicode is a complicated standard to implement, and just using UCS-4 isn't good enough. And there's always the possibility that they might need to define more than 4 billion code points in the future. :) (Though if that ever happened, we'd probably have to build a whole new standard anyway.)

The best thing to do is use a library like ICU, and let it worry about the internals of everything. As painfully large as it is for something (like string processing) which should be as simple, that's just the nature of the Unicode beast. That just leaves you with the external encoding, and here UTF-8 seems to be the winner, assuming storage/transfer size isn't an issue, due to the lack of byte ordering issues. As long as all your UTF-16 documents have a byte ordering mark, though, it's not too big a deal.

Re:Big Trouble in Little China. Don't use UCS-2. (0)

Anonymous Coward | more than 7 years ago | (#19030059)

You're correct about the bias, but it's important to emphasize the nature of the tradeoff. It's not "is my app multilingual" but "how much of the text I actually encounter is in European languages" that makes UTF-16 or UTF-8 a win. When space is a big enough issue you may actually be better off running UCS-4 through an adaptive compression algorithm to find a more customized representation.

The problem I have with advocating UTF-16 everywhere is that the longer a developer works with text without being confronted with variable-length encodings, the more they tend to think they can sort or split or display text simply by processing one codepoint at a time. Many libraries also have little-known problems handling surrogate pairs (such as transcoding each surrogate, producing "modified" aka invalid UTF-8) because characters outside the BMP are still so rare.

I don't see any advantages for base64 or quoted-printable over UTF-7, other than inertia. (Then again, if we ignore inertia, we can go ahead and use UTF-18 because it's one louder...)

Small teapot in Big China. UCS-2 slices and dices. (3, Interesting)

epine (68316) | more than 7 years ago | (#19029849)

That was a good post, but I don't understand your premise whatsoever. There seems to be two tactics at work here: arbitrary line drawing, and the belief that if you can't make everyone happy the best compromise is to make everyone unhappy. I read that post by Joel long ago, and I just read it again. I don't think he could have done a better job in the space devoted to it.

My one criticism of Joel was passing himself the "get out of jail free" card. Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. This is a fair disclaimer, but it makes it impossible to judge where Joel was simplifying deliberately and where he simplified because he didn't know any better. The correction would be for Joel to state "I'm going to simplify issues X, Y, and Zed". Then mistakes in the middle of the alphabet would be entirely his own. Just as there is no such thing as a string without a coding system, there is no such thing as a useful disclaimer that doesn't specify precisely what it disclaims. It amused me to see Joel invoke the ASCII standard of accountability.

Concerning the claim that Joel has made the same mistake [over again], this same claim comes up all the time concerning address arithmetic. How much existing code is portable to a 128 bit address size? We're sure to need this by 2050. Or perhaps not. People tend to neglect the observation that we're talking about a doubly exponential progression in codespace: (2^2^3)^2^N, with the values N=0,1,2,3,4 plausible in photolithographic solid state. On the current progression, for N=5 transistors would need to be subatomic. As far as the present transition from 32 bits to 64 bits of address space, it makes sense that operating systems and file systems are 64-bit native, while 99% of user space applications continue to run in less time and space compiled for 32 bits. Among the growing sliver of applications that do run better in 64-bits are a few applications of especially high importance.

I worked extensively with CJK languages in the early 1990s, and my opinion at the time was that UCS-4 primarily catered to the SETI crowd, and potentially, belligerent Mandarins in mainland China. I recall more argument at the time about Korean, which is a syllabic script masquerading as ideographic blocks.

http://en.wikipedia.org/wiki/Hangul [wikipedia.org]

I've always had a lot of trouble understanding the opposition to Han unification. Many distinctions in the origins of the English language were lost in the adoption of ASCII, such as the ae ligature and the old-English thorn (which causes many Hollywood sets to feature "Ye old saloon").

http://en.wikipedia.org/wiki/Han_unification [wikipedia.org]
http://en.wikipedia.org/wiki/Thorn_(letter) [wikipedia.org]

... Unicode now encodes far more [Han] characters than any other standard, and far more than were listed in any dictionary, with many more being processed for inclusion as fast as the scholars can agree on their identities.

Some characters used only in names are not included in Unicode. This is not a form of cultural imperialism, as is sometimes feared. These characters are generally not included in their national character sets either.


And all this fits quite nicely in UCS-2 as advocated by Joel.

A slight difference in rendering characters might be considered a serious problem if it changes the meaning or reflects the wrong cultural tradition. Besides a simple nuisance like Japanese text looking like Chinese, names might be displayed with a different glyph -- the same character in the sense of encoding but a different character in the view of the users. This rendering problem is often employed to criticize Westerners for not being aware of subtle distinctions, even though Unification is being carried out by Easterners. The display error occurs only when rendering plain text in a single font, and not when rendering language-specific text and names in language-appropriate fonts.


Here is the true nub of the matter: their is a community of Chinese who view themselves as inconvenienced if all the nuances of the ideographic Han language group can be fully distinguished rendering in a single, language-neutral font. If a Han character string merely indicates, in addition to its encoding, what Han language it renders most correctly, the vast majority of potential display problems (fairly rare to begin with) are reduced by another order of magnitude.

In any case, I have a tough time accepting the premise that the British Commonwealth was well served by ASCII, while the east Asia got jilted on Han unification. The main difference is that ASCII has been with us for so long we've forgotten how to complain. Try programming in C on an IBM terminal lacking curly braces. One recalls how to complain very quickly.

Is it a small inconvenience that 7-bit ASCII lacks the ability to represent the dollar, pound, and Euro symbols with a single codepage? Hardly. No Han unification was as brutal as the unification of the dollar and pound sign. What about the missing degree symbol? Or the typographic en and em dashes?

From a pure computer science perspective, ASCII caused the profession no end of grief by failing to distinguish between the unary minus operator and the negation symbol used to designate a negative constant value.

From http://mathforum.org/kb/message.jspa?messageID=565 8100&tstart=0 [mathforum.org]

[With the advent of formal computation] you couldn't get
along any longer with the fuzzy thinking in this respect that had
governed development of standard pencil-and-paper notation for the
previous few centuries, an environment in which the '-' sign, for
example, is used both to indicate negation (a unary operation) and
subtraction (a binary operation). And also sometimes as a simple marker
of negativeness, not an operation at all. Maybe some remember a
movement in the '60's to introduce into textbooks (and student writing)
an alternate symbol, a small "high minus", to indicate the latter as a
matter of clarity. Good idea, but apparently too cumbersome a
distinction to maintain in practice, and today we all still get along
with a plain old minus symbol and just muddle along with the logical
ambiguity involved.


How much better would the C language have been if we had a high minus and a high plus to denote signed constants, and employed the minus/short hyphen unambiguously as a unary/binary negation operator?

ASCII was never terribly convenient, but over the years we invested a lot in learning to accept its limitations. In my view UCS-2 is far less cumbersome to Asian nations than ASCII, even within the nations of British heritage that it served best.

I was thinking to myself that a string of length N over an alphabet of cardinality A can represent A^N distinct strings. For N=0 (strings of zero length) you can represent exactly one thing, the only problem being, as Heinlein once observed, we've given this single thing 9 billion different names, and those won't even fit in UCS-4, and once the SETI people chime in, not even UCS-8 will suffice any longer.

Re:Small teapot in Big China. UCS-2 slices and dic (1)

argent (18001) | more than 7 years ago | (#19031939)

My objection is that he's saying "use UCS-2, it solves the problem" when UCS-2 doesn't solve the problem and it creates new ones.

What he should have said is "use a wide character library that supports Unicode-2, no matter what it uses internally, and make sure your code still works if they change the encoding behind your back".

I don't know why you're going on about "I have a tough time accepting the premise that the British Commonwealth was well served by ASCII". I didn't say that anyone was well served by ASCII.

please do not mod anything by Joel up again (0)

Anonymous Coward | more than 7 years ago | (#19029185)

It's been asked here several times and I beg the moderators: do not mod up any link pointing to Joel Spolsky up again. He's pretending to know something but he's basically wrong on everything.

It was stated here in a very rational argument that the guy knows nothing besides some Windows-centric VB and VC++ (and perhaps now C#).

He's full of himself because he worked on the MS Office codebase. Please, stop the insanity.

Once more a link to an article spreading misconception by Joel: this time it's about what Joel poorly understood about Unicode.

Such an article only highlights what several people here have been saying since quite some time now: Joel Spolsky is a mister nobody spreading misconception.

His article about Unicode is very flawed.

As argent (/. id 18001) pointed out: the explanation of Unicode by Joel is completely shortsighted. He's thinking that Unicode 3.0 (which only support 65 536 codepoints) is "the one true Unicode". This is a gigantic mistake. Moreover he's confusing character sets, code points and encodings.

The random babbling he's making only works for, as it has been stated, he's lucky to work using tools that are broken in the same ways: this is simply inexcusable.

Let me state it once and for all: anything that Joel writes is only half-truth. The guy completely lacks rigor and his writings are not interesting. And he always comes up with dramatic title like "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" though this should really be reworded to something like "The misconception Joel Spolsky has about Unicode in its outdated Windows-centric world". And this is not just about Unicode, it's like that with everything that this uninteresting person writes.

So moderators, please, mod down posts linking to Joel and mod -1 /. ids of people posting links to Joel's lame wrong rantings.

Re:please do not mod anything by Joel up again (0)

Anonymous Coward | more than 7 years ago | (#19030281)

He makes a point of saying the most common misconception about Unicode is that it's a 16-bit encoding with 65536 characters. He even distinguishes between ISO 8859-1 and "Windows-1252". The only mistakes I see are saying UTF-16 and UCS-2 are the same thing and glossing over surrogates. He knows more than most devs I've had to work with, and I wish they had read this even if I would have added more.

Re:Useful? This is damned awesome! (1)

Krischi (61667) | more than 7 years ago | (#19019121)

This ticket [wordpress.org] contains a patch that more or less allows you to use Wordpress blogs with UTF-8 encodings.

Pffft Easy... (4, Funny)

Frogbert (589961) | more than 7 years ago | (#19017163)

First you get the names of every table in the old database

Then you create these tables in the new one. Just so there are no problems with data types you should probably just make every field varchar(100) in the new one.

Then you fire up MS Access, the older the better. I try to stick to Access 95.

Then you create two ODBC links, one to your old one and one to the new one.

Then you use the linked table manager to link each table to ms access.

Then you open both the new table and the old table and select all, copy and paste the data into the new table.

It's so simple even a child could do it!

MOD PARENT UP (0)

Anonymous Coward | more than 7 years ago | (#19017191)

+5 Insightful.

Re:Pffft Easy... (1)

PenguSven (988769) | more than 7 years ago | (#19017233)

and this helps anyone using linux, unix, mac, or without an office license HOW?

Re:Pffft Easy... (4, Funny)

Negatyfus (602326) | more than 7 years ago | (#19017453)

Linux doesn't use databases. Flat text files and grep work just as well as this overly complicated "SQL" crap.

Re:Pffft Easy... (1)

Tablizer (95088) | more than 7 years ago | (#19021289)

Linux doesn't use databases. Flat text files and grep work just as well as this overly complicated "SQL" crap.

Don't laugh, I've encountered a techie with just that very opinion. I suppose with enough script hacking, you end up reinventing a sort of database engine such that rather than avoiding a database you are essentially rolling your own. The downside is that one has to learn your conventions from scratch when you move on, while RDBMS are semi-standardized. (Yes, SQL sometimes sucks and should be updated IMO, but it is still better than rolling your own except in rare situations.)
       

You can have both! (1)

Krishnoid (984597) | more than 7 years ago | (#19047245)

  1. Try out DBD::CSV [cpan.org]
  2. grep the flat file
  3. SQL with joy!

Easy... (-1, Redundant)

Anonymous Coward | more than 7 years ago | (#19017491)

You don't fall into the trap of using Access. Access is a nice single user database, but that is all.

Re:Easy... (0)

Anonymous Coward | more than 7 years ago | (#19106721)

Access is okay if you link it to a real SQL backend database and never link any "editable" fields to the table. I use all unbound controls and then populate everything in code, then at "save" time run a check of every control, and only if everything passes the check does an UPDATE or INSERT SQL statement get generated in my code then executed to the back-end. ADO and even DAO in VBA are your friends!
Locking is no problem. If needed, I implement it myself or use SQL direct to allow use of transactions.

Re:Pffft Easy... (1)

trianglman (1024223) | more than 7 years ago | (#19019153)

Don't worry; it doesn't help anyone using Windows either.

Re:Pffft Easy... (1)

madbawa (929673) | more than 7 years ago | (#19017349)

Access 95 is out???!!! Wow!!!

Re:Pffft Easy... (0)

Anonymous Coward | more than 7 years ago | (#19017419)

Some people may find this humorous but it is a very common occurrence.

Re:Pffft Easy... (2, Funny)

Jacco de Leeuw (4646) | more than 7 years ago | (#19018167)

So very simple,
that only a child can do it!

how about ... (0)

Anonymous Coward | more than 7 years ago | (#19017205)

... simply moving/copying binary data files?

Re:how about ... (3, Informative)

Majik Sheff (930627) | more than 7 years ago | (#19017265)

It usually works as long as you're staying on the same architecture. I successfully pulled this off when a client's DB server died horribly with no functional backups in sight. I salvaged the vast majority of the binary tables and dropped them into a fresh install of MySQL. After the migration of the binaries I renamed the tables to *_bak and told MySQL to dump the contents into freshly created data files. Then I set my client on the task of assessing damage to the data. I would only recommend this tactic if you're doing crash recovery on a borked system, as there are risks. P.S. Have you hugged your backups today?

Re:how about ... (1)

cheater512 (783349) | more than 7 years ago | (#19017489)

I've done it recently to a database which was around the 60gig mark with 2 million tables.

Worked beautifully going from 4.1 to 5.0.
Different arch's too. Intel 32bit to AMD64. I wouldnt do the reverse though.

Re:how about ... (0, Troll)

BigDumbAnimal (532071) | more than 7 years ago | (#19017589)

I've done it recently to a database which was around the 60gig mark with 2 million tables.
Who in the heck has 2 million tables in a database?

Re:how about ... (1)

cheater512 (783349) | more than 7 years ago | (#19017681)

http://www.freepowerboards.com/ [freepowerboards.com]

Re:how about ... (1)

CableModemSniper (556285) | more than 7 years ago | (#19019021)

But why? No, I changed my mind I don't want to know.

Re:how about ... (1)

tepples (727027) | more than 7 years ago | (#19026211)

But why?

Could it have something to do with 30 tables per installed instance of BB software, and 30,000 instances, perhaps?

Re:how about ... (1)

Majik Sheff (930627) | more than 7 years ago | (#19017745)

Wow, it's far more resilient than I thought. There were dire warnings about it when I first learned to handle MySQL, but that was version 3.something. I suppose endianness is/was the source of their concerns...

Re:how about ... (1)

cheater512 (783349) | more than 7 years ago | (#19017893)

Yeah it (usually) works fine. Far faster than mysqldump.

I'd always use mysqldump for smaller datasets though.

Re:how about ... (1)

Loconut1389 (455297) | more than 7 years ago | (#19020375)

In a desperation move, when my Sparc server went down, I managed to move the whole database filesystem over to an SGI box- so that's from TurboSparc to MIPS 64-bit. Amazingly, it worked.

I wouldn't count on that kind of luck ;)

Uhhhhhh... (-1, Flamebait)

Anonymous Coward | more than 7 years ago | (#19017263)

1) determine the encoding of your existing data (hint: if it has two or three junk characters in a row where you expected one non-ASCII character, it's probably UTF8).

2) set your table to use that encoding

3) import your data

If you're a web programmer reading this right now, do the world a favor and learn about the following two things: #1 character encodings #2 time zones. (Oh and #3 basic data management theory, but I know, that's stretching it). I see these things done wrong. All. The. Fucking. Time.

How is this news? (1)

cabinetsoft (923481) | more than 7 years ago | (#19017273)

How is this news??? Later on today: "Data from POST/GET and special characters"?

Re:How is this news? (5, Funny)

kestasjk (933987) | more than 7 years ago | (#19017427)

If there's a chance of starting a PostgreSQL vs MySQL flamewar, it's news.

Re:How is this news? (2, Interesting)

arivanov (12034) | more than 7 years ago | (#19017627)

True.

As well as a chance of posting an arcane method of database transition involving MySQL to start an ACID war.

As well as on the original subject of the article - the best way to migrate an application is to load all of the data from one datasource and dump it into another datasource. If the application fails this trivial test its database access libraries are broken. If the app sticks strictly to dynamic SQL, high level DBI functions and does no manual escaping - it just works. The escaping portion of the SQL libs take care of it and ensure it is mapped correctly both ways. If the app tries to escape by hand, sticks data into teh SQL statement itself, etc - it fails. Same for utf/latin transitions and the like.

Re:How is this news? (1)

ajs (35943) | more than 7 years ago | (#19023889)

I have a serious question to ask, but I'm sure it's going to sound like an invitation to a flame war. Please refrain.

Does anyone actually use PostgreSQL? I mean, I know it's the defacto database that we wave around when we want to bash MySQL, but that doesn't mean anyone uses it. I've yet to run into anyone who used PostgreSQL except as a rapid-prototype for an Oracle environment. Anyone have data points here? Does anyone know the rough sizes of the user bases? Are we really just waving PostgreSQL like a flag, or is it a widely used database in its own-right?

Re:How is this news? (1)

TheBracket (307388) | more than 7 years ago | (#19024631)

Not much of a data-point, but we use it for just about every project that needs an RDBMS. It's overkill for much of what we do, but it's very solid. I've also seen it embedded into a few commercially available systems, such as a wimax provisioning system, and a point-of-sale program. It's not as ubiquitous as MySQL, but it does get around.

Re:How is this news? (1)

larry bagina (561269) | more than 7 years ago | (#19029783)

I use postgres. Things like views, foreign keys, check constraints, stored procedures, and functions make life a lot easier. When I was using mysql (4.x), I had to fake that stuff in user code, which was twice the work and not nearly as clean (or reliable). mysql 5 has some of that stuff now, but they're still a decade behind.

Re:How is this news? (1)

ajs (35943) | more than 7 years ago | (#19040417)

mysql 5 has some of that stuff now, but they're still a decade behind.
I guess it was unavoidable that any question asked about PostgreSQL usage had to involve MySQL bashing... sigh.

Re:How is this news? (0)

Anonymous Coward | more than 7 years ago | (#19038885)

> Does anyone actually use PostgreSQL? I mean, I know it's the defacto database that we wave around when we want to bash MySQL, but that doesn't mean anyone uses it.

Sony Online Entertainment uses it (modified) for their MMORPG's. Afilias and PIR (the registrars for .info and .org) run on it.

Re:How is this news? (-1, Flamebait)

Anonymous Coward | more than 7 years ago | (#19017621)

How is this news??? Later on today: "Data from POST/GET and special characters"?
It's a bug report for "yet another way MySQL corrupts your data". People thought MySQL fixed that when they finally got around to supporting basic ACID compliance, but they are rather nostalgic about the old ways and decided to keep around a corruption or two.
 

Unicode integration woes (4, Interesting)

DJ Rubbie (621940) | more than 7 years ago | (#19017301)

As I understand it, the problem arises from the fact that mysqldump uses utf8 for character encoding while, more often than not, mysql tables default to latin1 character encoding. (If you were smart enough to manually set the character encoding to utf8, then you'll have no problems - everyone running mysql 4.0 or early will be using latin1 since it didn't support any other encodings.) So lets say we have a database named example_db with tables that have varchar and text columns. If you have special characters that are really UTF-8 encoded characters stored in the db, it works just fine until you try to move the db to another server.

That bit me one time when one of my live servers crashed and I had to load the data on the backup onto a different server. I remember wondering to myself, when was the good old days when a database was a dumb (smart, depending your POV) engine that only worries about a string of bytes (chars). Seriously, it only should become smarter and start talking in unicode only when I want it to.

Issues with using unicode is not just limited to MySQL, as Python have similar issues. However they are almost always caused by poor programmers who mixes usage between the two, or not doing type checking on the proper type (basestring).

Re:Unicode integration woes (1, Insightful)

Anonymous Coward | more than 7 years ago | (#19017449)

However they are almost always caused by poor programmers who mixes usage between the two
Yes, and in this case the poor programmers wrote the mysqldump program.

Re:Unicode integration woes (1)

Blakey Rat (99501) | more than 7 years ago | (#19019339)

The real MySQL problem with Unicode is that it hardly supports it. Every other database engine on Earth has proper Unicode support (even SQLite, which isn't even really a database), why is MySQL so far behind everything? The more I use MySQL, the more I hate it. Hate, hate, hate.

Of course, since Dreamhost refuses to install a half-decent database on their servers, I'm stuck using it. Does anybody know how to install Firebird on a Dreamhost account and make it work? Is it even possible?

Re:Unicode integration woes (1)

EssenceLumin (755374) | more than 7 years ago | (#19022583)

At least with a standard dreamhost account it's against the tos. No servers of your own allowed. I wrote them asking the identical question except with postgres and that is the answer I got.

Re:Unicode integration woes (1)

jadavis (473492) | more than 7 years ago | (#19022105)

good old days when a database was a dumb (smart, depending your POV) engine that only worries about a string of bytes (chars)

Databases use data types for the same reason data types are used in programming languages.

Relational databases offer a lot more as well that I won't go into. But if you don't care about any of those things and just want to store bytes, there are plenty of ways to do that, and there have been for a long time (files are the most obvious example).

Re:Unicode integration woes (1)

DJ Rubbie (621940) | more than 7 years ago | (#19028685)

I am well aware of the extra functionality presented by SQL and I make liberal use of them, which is why I use that in the first place. However, I don't need the database to pretend it can be really smart about certain things unless I told it to, and as Blakey Rat's (99501) sentiment, the more I used MySQL, the more I hate it.

As for my dumb/smart comment, I only want my database to be smart at what it was supposed to do, and only do what I want it to do, not guessing what I might want to do also and mess up the output and resulting database dump.

Re:Unicode integration woes (1)

jadavis (473492) | more than 7 years ago | (#19030955)

and only do what I want it to do

Fair enough.

not guessing what I might want to do also and mess up the output and resulting database dump.

That's the result of a very bad implementation of "smart" ;)

I think correct use of character encodings and locales are the way to go with software, but only if done correctly. And you certainly can't count on MySQL AB to do something correctly.

mysqldump is too think (2, Interesting)

hpavc (129350) | more than 7 years ago | (#19017411)

This guys mysqldump statement could use some args, too much is packed in his my.cnf defaults to make this truely useful as a how to. He could easily cause more problems than he is solving.

Re:mysqldump is too think (1)

El_Muerte_TDS (592157) | more than 7 years ago | (#19017799)

Agreed, the proper mysqldump command should be at least:
mysqldump --opt --default-character-set=latin1 database

Something like this might be more interesting
mysqldump --opt --default-character-set=latin1 database1 | sed "s/SET NAMES latin1/SET NAMES utf8/" | mysql database2

And then simply switch the application to use database2 instead of database1 (or rename the databases)

What's with the sudden influx of gnubies? (4, Insightful)

rylin (688457) | more than 7 years ago | (#19017575)

Not a single day seems to go by without a gnubie or two posting things that are really basic knowledge.
If you do insert unicode data into a latin1 table, you will get unexpected results.

What you do is make sure that your:
a) database(s) are set to utf8 by default
b) table(s) are set to utf8 by default
c) column(s) are set to utf8 by default
d) connection defaults to utf8
(provided, of course, that it's utf8 you're after)

That way, it'll "Just Work"(tm)

We've gone through upgrades from 3.23 -> 4.0 -> 4.1 -> 5.0 and never had a problem; and yes, our tables were all latin1 from the beginning.

just work!?!? (0)

kpharmer (452893) | more than 7 years ago | (#19019519)

> What you do is make sure that your:
> a) database(s) are set to utf8 by default
> b) table(s) are set to utf8 by default
> c) column(s) are set to utf8 by default
> d) connection defaults to utf8

Please tell me that if I specify a database is UTF8 that I don't also have to tell it that each column is utf8 as well!

And why should I have to tell the connections? Doesn't that get resolved automatically when the client connects to the server?

And why should I have to tell a utility what the target database codepage is? Can't they talk? Why the opportunity to manually do something trivial and screw it up?

And...'by default'? Doesn't mysql know how to automatically perform codepage conversions?

And, please tell me that MySQL wouldn't let somebody insert utf8, say asian characters, into a latin1 database!

> That way, it'll "Just Work"(tm)
ugh, i don't think that expression means what you think it means

does mysql have a list of prioritized missing basic features and functionality so that you get can a sense of when this kind of stupidity will be resolved?

Re:just work!?!? (1)

rylin (688457) | more than 7 years ago | (#19040965)

I guess I should've clarified or looked back at the thread sooner... :P
You can set the database (or default connection e.t.c.) to have utf-8 by standard.
Then again, you might be using a third-party admin utility to create a table, and the util might be ignoring the db default and create a latin1 table... or latin1 column.

E.g., things can go wrong (obviously) - but if you specify that the DB is utf8, then any unspecified changes to the tablespace in it will default to utf8.

Hopefully that clears it up ;-)

Re:What's with the sudden influx of gnubies? (2, Insightful)

cortana (588495) | more than 7 years ago | (#19020225)

Not a single day seems to go by without a gnubie or two posting things that are really basic knowledge.
If you do insert unicode data into a latin1 table, you will get unexpected results.
Ah, I love MySQL. They should fix it so that if you insert unicode data where latin1 data is expected, you get an error instead of silent data corruption.

Re:What's with the sudden influx of gnubies? (2, Interesting)

AaronLawrence (600990) | more than 7 years ago | (#19021981)

The whole point of UTF-8 is that it can silently be inserted in places that were designed to handle ASCII. So no, there is no way for something which is handling latin1 to know that what you gave it is actually UTF8 and therefore not legal.

Re:What's with the sudden influx of gnubies? (0)

Anonymous Coward | more than 7 years ago | (#19028879)

Any Unicode character between U+0080 and U+07FF (virtually all non-ASCII European text) will use octets between 128 and 159 when encoded in UTF-8. If an app accepts that encoding without giving an error, it doesn't support ISO 8859-1 (in which those aren't characters), it supports "whatever meaningless garbage you want followed by a zero", and it shouldn't ever attempt transcoding as MySQL apparently does.

Re:What's with the sudden influx of gnubies? (1)

FooBarWidget (556006) | more than 7 years ago | (#19023509)

How is MySQL supposed to know that the data it gets is unicode instead of latin-1? Latin-1 characters are 1 byte, what MySQL receives may be an actual unicode string or may just be a sequence of latin-1 characters. There's no way to reliably find out.

Re:What's with the sudden influx of gnubies? (1)

raynet (51803) | more than 7 years ago | (#19024091)

Perhaps you could set the used character encoding while connecting to the database. Atleast PostgreSQL supports this, though it will actually do a conversion from Unicode to Latin1 in this case and barf an error if it cannot map some Unicode chars to Latin equivalents.

Re:What's with the sudden influx of gnubies? (1)

FooBarWidget (556006) | more than 7 years ago | (#19024557)

MySQL supports that too. If people don't use it - well - then it's their own fault.

Re:What's with the sudden influx of gnubies? (1)

newt0311 (973957) | more than 7 years ago | (#19020257)

How about: CREATE DATABASE blah ENCODING 'UNICODE';
Or does that not work in MySQL?

Awesome! (1)

mattgreen (701203) | more than 7 years ago | (#19020319)

So I have to remember to do EACH and EVERY one of those things so it "Just Works?"

Unbelievable.

Tagged slashdigg (1)

Cutie Pi (588366) | more than 7 years ago | (#19018407)

Anyone noticing a strong similarity between this article and about 95% of the stuff posted on digg every day? What's next, "The Top 10 Photoshop Filters of All Time?" :) Tagged slashdigg.

Re:Tagged slashdigg (1, Funny)

Anonymous Coward | more than 7 years ago | (#19019139)

No, it's "The Top 10 Photoshop Filters for Ron Paul!"

Re:Tagged slashdigg (1)

PinkPanther (42194) | more than 7 years ago | (#19020125)

"How to sort MS-Word tables"...

"Excel & the power of CSV files"

"Bottleneck your CPU: How to gain MASSIVE performance improvements in any computer: defrag.exe"

"Make your data manageable: separating content from style the CSS way!!"

"FREE SOFTWARE! The best 10...er...20...er...100...free (and open source) software products"

PinkPanther bangs head on wall

NEWS: Slashdot bought by webnoobs.com (1)

Riquez (917372) | more than 7 years ago | (#19018707)

From TFA:

If you were smart enough to manually set the character encoding to utf8, then you'll have no problems
That's all that's needed here, the whole article & this page can be condensed in a 1 line Question & 1 line Answer - doh!
Oh & you can change the encoding at anytime, so even if you initially forget you just change it later, then dump.

I did it like this (0)

Anonymous Coward | more than 7 years ago | (#19018983)

Upgrade and copy the old datadir to new server
mysqldump --default-character-set=binary --compat=mysql40 olddb | mysql --default-character-set=cp1250 newdb

New default charset doesn't matter, the data is converted to whatever charet you set for the newdb when you created it. Important stuff is only the cp1250, that's the real charset of the data you had in the olddb. If it's different from yours, you should recejt it and substitute your own ...

Why use mysqldump? Why not just scp+mysqlrepair ? (1)

v3xt0r (799856) | more than 7 years ago | (#19048539)

when doing (non-'real-time'-critical) migrations... I find it much less troublesome to just scp the files and then run mysqlcheck to repair the tables on the new server (if required).

http://www.seanodonnell.com/code/?id=66 [seanodonnell.com]

That process prevents the syntax conversion issues, storage and bandwidth requirements, and processing-time requirements caused by using mysqldump.
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?