Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Re: Weird sorting of japanese filenames



Tobias Diedrich writes:

 > Indeed "LC_COLLATE=ja_JP.UTF-8 ls" does help. I wonder why I didn't
 > think of trying that myself...

Because you were never bitten by the "LANG=en_US ls [A-Z]*" bug,
perhaps?  (The bug is that in that locale, the collation order is
"AaBbCc...Zz", so the range [A-Z] includes all lowercase ASCII except
"z".)

 > I just expected that with utf-8 sorting decisions would only marginally
 > differ between locales...

No, IIRC the Japanese locale sorts in JIS order, which is more or less
50 onjun through JIS X 0208, while Chinese locales sort according to
bushu and stroke count.  Radically different!

You see, LC_COLLATE is one of the reasons why locales require so much
space and effort in Your Linux System.  Sorting is table-driven, not
code-point-driven.  Probably what happens is that all of the
characters not in the current LC_COLLATE table get the same value,
lower precedence than any character in the table.  So what you get is
a sort which gives higher precedence to a string whose first Latin
character is early, and when the first Latin position is the same,
sorts according to the locale's table.  See strxfrm(3) (and on Linux
systems, strfry(3), which may have something to do with Dave M G's
Firefox mojibake woes ;) for more information.  You'll have to read
between the lines, it *specifies* strxfrm rather than telling you how
it's used.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links