Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]



On 28 June 2018 at 14:52, Stephen J. Turnbull
<turnbull.stephen.fw@example.com> wrote:
> Jim Breen writes:
>  > I have this issue with the wwwjdic spaghetti code (mea culpa).  Parts of it
>  > date from before Unicode and UTF-8 existed, and all the internal data
>  > structures are built around the assumption that kana and kanji take
>  > two bytes.
>
> That is REALLY important.  Don't even think of giving up fixed width!

I wouldn't. In fact I do have to deal with variable a teeny bit now,
as the EUC-JP
encoding of JIS X 0212 code-points is done with 3 bytes. Mercifully
there are few of
them that I have to worry about at the character level.

>  > I'd love to move it over to using UTF-8 internally,
>
> Only if you hate your successor maintainers.  Take it from somebody
> who's worked with the internals of both Emacsen and Python,
> fixed-width beats variable-width the way Clinton beat Trump in
> California.

Since the things I'm trying to handle are all 3 bytes in UTF-8 I'd be
just as fixed-width as I am now, i.e. mostly.

> ...  Like Trump nationally, UTF-8 squeaks out a win for
> interprocess interchange, but for the internal implementation of
> characters (if you have them[1]) and strings, use fixed-width.  BTW,
> UTF-16 is close enough to fixed-width in general[2], but might not be for
> wwwjdic due to emoji and "rare" characters assigned to astral planes
> being common dictionary lookups.  OTOH, if it's just a matter of
> matching short strings pretty much exactly (and you won't have the
> hira/kata/halfwit issue with anything in the astral planes, I think),
> you can probably just treat surrogates as funny characters that always
> occur in pairs.

I've certainly thought about the UTF-16/UCS-2 alternative, and if I was
starting from scratch I'd probably do it that way. In terms of transitioning
the present wwwjdic code to it I'd probably find working in UTF-8
easier.

> Also, UTF-8 represents most of the BMP in 3 octets per character,
> including all JIS characters.  I don't know if this gets awkward for
> wwwjdic, but it was in Emacs (Mule code has the same kinds of issues
> as UTF-8 because its character structure was designed to the same
> specifications of ASCII compatibility and so on).

I don't think that would be awkward at all. Much of my recent text-processing
work has used UTF-8 throughout and it's not been a problem.

> Footnotes:
[..]
> [2]  Python 2's 'unicode' type is encoded as UTF-16 internally,
> although it's treated as an array of 16-bit code units rather than
> variable width characters.  This has worked "good enough" for almost 2
> decades.  That said, Python 3 moved to a content-dependent fixed-width
> type.  If your string is all ISO-8859-1, it's encoded as an array of
> octets.  If it contains even one astral character, it's UTF-32.
> everything else is UCS-2 (aka the subset of UTF-16 excluding
> surrogates).  Be warned guys: even one emoji in a Python 3 str will
> double, maybe quadruple, the memory requirements.  Hm.  I think I see
> a DoS attack ... just append  %G👺 %@!

That approach sort-of makes sense, but I'd hate to be maintaining it.

Anyway there'll be no "successor maintainers" for wwwjdic. I'll instruct
my executors to put it on the bonfire, along with my used toothbrushes
and underpants.

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links