TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] Database collations

Date: Sat, 21 Sep 2024 03:14:41 +0900

From: Brian Chandler <brian@example.com>

Subject: [tlug] Database collations

User-agent: Mozilla Thunderbird
Any database experts here? (specifically mysql)
I have a number of closely related questions about setting charactersets and collations. I have a DB application that has been running for>20 years, originally before there was any support for UTF8. So Ipretended to be using latin1-Swedish ("Swedish" for short); sending bytesequences encoded in UTF8 which I let the server believe was Swedish. Atsome point I tried to untangle this and change the collation on the DBtables to UTF8, but unfortunately the server helpfully converted my"Swedish" to what it thought was UTF8, leaving double-encoded UTF (i.e.each byte of my UTF8 with the top bit set was interpreted as Latin1,then encoded as three bytes of UTF8). But this kept on working; somehowI was getting back the bytes I had sent, until something got updated andit all broke. I have more or less fixed things, but I want to sort itall out properly. From previous experience I am nervous about "justchanging settings", in case Mysql tries to "help" by immediatelychanging stuff in the DB.
1. Server: this shows "Server connection collation", currently'Swedish'. I'm not entirely clear what this means, apart from being thedefault encoding for a new database. But does it also mean the encodingof literal strings in any command I send to MySQL? Can I assume that ifI change it this will not change anything in the DB?
2. Similarly the DB default is set to Swedish; I guess I can safely justchange it. And likewise the Table defaults?
3. There are two classes of text in the data: general strings (basicallyEn and Ja) which should be UTF8mb4, and "codes", which must bealphanumeric (plus possibly _ and so on). So I feel I should make theseASCII, to enforce no funny characters. Is there any reason not to dothis, having two different encodings in the same table?
4. Oh dear, the big one. What is in practice the best collation forJapanese and English? I have read some of the MySql stuff, and they makeit fairly clear they are not entirely sure what they are doing. I do notexpect to sort any Japanese text, and I don't want anything "clever". Itwould help to have case-ignore for Roman letters (not that Japanese hascase, but I expect they reinterpret "case" to mean something else); butnot to have the びょういん・びよういん problem for example. OTOH, I supposeit would help if bogus characters like "zenkaku Roman" were equated withthe real Roman characters.
ps: can someone tell me how to change the subscription settings?
Follow-Ups:

Re: [tlug] Database collations
From: Josh Glover

Prev by Date: Re: [tlug] Tex Editor Recommendations?

Next by Date: [tlug] Maker Faire Tokyo 2024

Previous by thread: Re: [tlug] micro sdhc card recovery

Next by thread: Re: [tlug] Database collations

Index(es):

Date

Thread

Home | Main Index | Thread Index