Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]

Jim Breen writes:

 > I don't think [fixed-width 3-octet] would be awkward at all. Much
 > of my recent text-processing work has used UTF-8 throughout and
 > it's not been a problem.

OK.  A lot of the issues with Emacs and odd octet widths come from
generic memory management where many systems really like power-of-2
alignment, and certain kinds of string matching, which it turns out
can be greatly speeded up if you do them 32 or 64 bits at a time :-).

 > > Python 3 moved to a content-dependent fixed-width type.  If your
 > > string is all ISO-8859-1, it's encoded as an array of octets.  If
 > > it contains even one astral character, it's UTF-32.  everything
 > > else is UCS-2 (aka the subset of UTF-16 excluding surrogates).
 > That approach sort-of makes sense, but I'd hate to be maintaining
 > it.

A plausible take, but that kind of code has been very stable in my
experience.  Once you have the (simple) array of characters accesses
and mutations code correct, and the (also simple) widening and
narrowing code correct, optimizations tend to be very local and easy
to do correctly.  Of course you have to do things through the API
which slightly limits how efficiently you can access and mutate the
underlying storage, but it's still wicked fast compared to Emacs. ;-)

 > Anyway there'll be no "successor maintainers" for wwwjdic. I'll instruct
 > my executors to put it on the bonfire, along with my used toothbrushes
 > and underpants.

As Kori Schake[1] likes to say, "Jim, I did not need that visual!"


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links