Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]

Jim Breen writes:

 > I have this issue with the wwwjdic spaghetti code (mea culpa).  Parts of it
 > date from before Unicode and UTF-8 existed, and all the internal data
 > structures are built around the assumption that kana and kanji take
 > two bytes.

That is REALLY important.  Don't even think of giving up fixed width!

 > I'd love to move it over to using UTF-8 internally,

Only if you hate your successor maintainers.  Take it from somebody
who's worked with the internals of both Emacsen and Python,
fixed-width beats variable-width the way Clinton beat Trump in
California.  Like Trump nationally, UTF-8 squeaks out a win for
interprocess interchange, but for the internal implementation of
characters (if you have them[1]) and strings, use fixed-width.  BTW,
UTF-16 is close enough to fixed-width in general[2], but might not be for
wwwjdic due to emoji and "rare" characters assigned to astral planes
being common dictionary lookups.  OTOH, if it's just a matter of
matching short strings pretty much exactly (and you won't have the
hira/kata/halfwit issue with anything in the astral planes, I think),
you can probably just treat surrogates as funny characters that always
occur in pairs.

Also, UTF-8 represents most of the BMP in 3 octets per character,
including all JIS characters.  I don't know if this gets awkward for
wwwjdic, but it was in Emacs (Mule code has the same kinds of issues
as UTF-8 because its character structure was designed to the same
specifications of ASCII compatibility and so on).

[1]  In Python only strings are primitive, and characters are
represented by strings of length one.

[2]  Python 2's 'unicode' type is encoded as UTF-16 internally,
although it's treated as an array of 16-bit code units rather than
variable width characters.  This has worked "good enough" for almost 2
decades.  That said, Python 3 moved to a content-dependent fixed-width
type.  If your string is all ISO-8859-1, it's encoded as an array of
octets.  If it contains even one astral character, it's UTF-32.
everything else is UCS-2 (aka the subset of UTF-16 excluding
surrogates).  Be warned guys: even one emoji in a Python 3 str will
double, maybe quadruple, the memory requirements.  Hm.  I think I see
a DoS attack ... just append %G👺%@!

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links