Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Database collations



Any database experts here? (specifically mysql)

I have a number of closely related questions about setting character sets and collations. I have a DB application that has been running for >20 years, originally before there was any support for UTF8. So I pretended to be using latin1-Swedish ("Swedish" for short); sending byte sequences encoded in UTF8 which I let the server believe was Swedish. At some point I tried to untangle this and change the collation on the DB tables to UTF8, but unfortunately the server helpfully converted my "Swedish" to what it thought was UTF8, leaving double-encoded UTF (i.e. each byte of my UTF8 with the top bit set was interpreted as Latin1, then encoded as three bytes of UTF8). But this kept on working; somehow I was getting back the bytes I had sent, until something got updated and it all broke. I have more or less fixed things, but I want to sort it all out properly. From previous experience I am nervous about "just changing settings", in case Mysql tries to "help" by immediately changing stuff in the DB.

1. Server: this shows "Server connection collation", currently 'Swedish'. I'm not entirely clear what this means, apart from being the default encoding for a new database. But does it also mean the encoding of literal strings in any command I send to MySQL? Can I assume that if I change it this will not change anything in the DB?

2. Similarly the DB default is set to Swedish; I guess I can safely just change it. And likewise the Table defaults?

3. There are two classes of text in the data: general strings (basically En and Ja) which should be UTF8mb4, and "codes", which must be alphanumeric (plus possibly _ and so on). So I feel I should make these ASCII, to enforce no funny characters. Is there any reason not to do this, having two different encodings in the same table?

4. Oh dear, the big one. What is in practice the best collation for Japanese and English? I have read some of the MySql stuff, and they make it fairly clear they are not entirely sure what they are doing. I do not expect to sort any Japanese text, and I don't want anything "clever". It would help to have case-ignore for Roman letters (not that Japanese has case, but I expect they reinterpret "case" to mean something else); but not to have the びょういん・びよういん problem for example. OTOH, I suppose it would help if bogus characters like "zenkaku Roman" were equated with the real Roman characters.

ps: can someone tell me how to change the subscription settings?


Home | Main Index | Thread Index