TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]

Date: Thu, 28 Jun 2018 13:52:02 +0900

From: "Stephen J. Turnbull" <turnbull.stephen.fw@example.com>

Subject: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]

References: <23345.41167.951877.900876@turnbull.sk.tsukuba.ac.jp> <23345.44414.330392.350450@turnbull.sk.tsukuba.ac.jp> <CAKXLc7c-LzgY5AtE8XrZzKUrr206nXtmxdtKQC0q8PkcMjiF7A@mail.gmail.com> <CABHGxq6CkEeQVHy7rjjbTP72mOm_QQ5xtthPuPR-QASYQoS_ag@mail.gmail.com> <07A05935-BBD8-4C13-AEF6-667D653EBE45@brightblack.net> <23346.65438.401753.15741@turnbull.sk.tsukuba.ac.jp> <CABHGxq5mnJgiSxGKEXZ4KYBAuVB0YBUwMqi+duoTk1iSeXj9PQ@mail.gmail.com>
Jim Breen writes:

 > I have this issue with the wwwjdic spaghetti code (mea culpa).  Parts of it
 > date from before Unicode and UTF-8 existed, and all the internal data
 > structures are built around the assumption that kana and kanji take
 > two bytes.

That is REALLY important.  Don't even think of giving up fixed width!

 > I'd love to move it over to using UTF-8 internally,

Only if you hate your successor maintainers.  Take it from somebody
who's worked with the internals of both Emacsen and Python,
fixed-width beats variable-width the way Clinton beat Trump in
California.  Like Trump nationally, UTF-8 squeaks out a win for
interprocess interchange, but for the internal implementation of
characters (if you have them[1]) and strings, use fixed-width.  BTW,
UTF-16 is close enough to fixed-width in general[2], but might not be for
wwwjdic due to emoji and "rare" characters assigned to astral planes
being common dictionary lookups.  OTOH, if it's just a matter of
matching short strings pretty much exactly (and you won't have the
hira/kata/halfwit issue with anything in the astral planes, I think),
you can probably just treat surrogates as funny characters that always
occur in pairs.

Also, UTF-8 represents most of the BMP in 3 octets per character,
including all JIS characters.  I don't know if this gets awkward for
wwwjdic, but it was in Emacs (Mule code has the same kinds of issues
as UTF-8 because its character structure was designed to the same
specifications of ASCII compatibility and so on).


Footnotes: 
[1]  In Python only strings are primitive, and characters are
represented by strings of length one.

[2]  Python 2's 'unicode' type is encoded as UTF-16 internally,
although it's treated as an array of 16-bit code units rather than
variable width characters.  This has worked "good enough" for almost 2
decades.  That said, Python 3 moved to a content-dependent fixed-width
type.  If your string is all ISO-8859-1, it's encoded as an array of
octets.  If it contains even one astral character, it's UTF-32.
everything else is UCS-2 (aka the subset of UTF-16 excluding
surrogates).  Be warned guys: even one emoji in a Python 3 str will
double, maybe quadruple, the memory requirements.  Hm.  I think I see
a DoS attack ... just append %G👺%@!
 
Follow-Ups:

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
From: Jim Breen

References:

[tlug] Kudos to Jim Breen
From: Stephen J. Turnbull

[tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
From: Stephen J. Turnbull

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
From: Kalin KOZHUHAROV

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
From: Jim Breen

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
From: grb

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
From: Stephen J. Turnbull

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
From: Jim Breen

Prev by Date: [tlug] Rikaichan in Firefox

Next by Date: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]

Previous by thread: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]

Next by thread: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links