
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
Jim Breen writes:
> I have this issue with the wwwjdic spaghetti code (mea culpa). Parts of it
> date from before Unicode and UTF-8 existed, and all the internal data
> structures are built around the assumption that kana and kanji take
> two bytes.
That is REALLY important. Don't even think of giving up fixed width!
> I'd love to move it over to using UTF-8 internally,
Only if you hate your successor maintainers. Take it from somebody
who's worked with the internals of both Emacsen and Python,
fixed-width beats variable-width the way Clinton beat Trump in
California. Like Trump nationally, UTF-8 squeaks out a win for
interprocess interchange, but for the internal implementation of
characters (if you have them[1]) and strings, use fixed-width. BTW,
UTF-16 is close enough to fixed-width in general[2], but might not be for
wwwjdic due to emoji and "rare" characters assigned to astral planes
being common dictionary lookups. OTOH, if it's just a matter of
matching short strings pretty much exactly (and you won't have the
hira/kata/halfwit issue with anything in the astral planes, I think),
you can probably just treat surrogates as funny characters that always
occur in pairs.
Also, UTF-8 represents most of the BMP in 3 octets per character,
including all JIS characters. I don't know if this gets awkward for
wwwjdic, but it was in Emacs (Mule code has the same kinds of issues
as UTF-8 because its character structure was designed to the same
specifications of ASCII compatibility and so on).
Footnotes:
[1] In Python only strings are primitive, and characters are
represented by strings of length one.
[2] Python 2's 'unicode' type is encoded as UTF-16 internally,
although it's treated as an array of 16-bit code units rather than
variable width characters. This has worked "good enough" for almost 2
decades. That said, Python 3 moved to a content-dependent fixed-width
type. If your string is all ISO-8859-1, it's encoded as an array of
octets. If it contains even one astral character, it's UTF-32.
everything else is UCS-2 (aka the subset of UTF-16 excluding
surrogates). Be warned guys: even one emoji in a Python 3 str will
double, maybe quadruple, the memory requirements. Hm. I think I see
a DoS attack ... just append %G👺%@!
Home |
Main Index |
Thread Index