Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- Date: Thu, 28 Jun 2018 13:52:02 +0900
- From: "Stephen J. Turnbull" <turnbull.stephen.fw@example.com>
- Subject: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- References: <23345.41167.951877.900876@turnbull.sk.tsukuba.ac.jp> <23345.44414.330392.350450@turnbull.sk.tsukuba.ac.jp> <CAKXLc7c-LzgY5AtE8XrZzKUrr206nXtmxdtKQC0q8PkcMjiF7A@mail.gmail.com> <CABHGxq6CkEeQVHy7rjjbTP72mOm_QQ5xtthPuPR-QASYQoS_ag@mail.gmail.com> <07A05935-BBD8-4C13-AEF6-667D653EBE45@brightblack.net> <23346.65438.401753.15741@turnbull.sk.tsukuba.ac.jp> <CABHGxq5mnJgiSxGKEXZ4KYBAuVB0YBUwMqi+duoTk1iSeXj9PQ@mail.gmail.com>
Jim Breen writes: > I have this issue with the wwwjdic spaghetti code (mea culpa). Parts of it > date from before Unicode and UTF-8 existed, and all the internal data > structures are built around the assumption that kana and kanji take > two bytes. That is REALLY important. Don't even think of giving up fixed width! > I'd love to move it over to using UTF-8 internally, Only if you hate your successor maintainers. Take it from somebody who's worked with the internals of both Emacsen and Python, fixed-width beats variable-width the way Clinton beat Trump in California. Like Trump nationally, UTF-8 squeaks out a win for interprocess interchange, but for the internal implementation of characters (if you have them[1]) and strings, use fixed-width. BTW, UTF-16 is close enough to fixed-width in general[2], but might not be for wwwjdic due to emoji and "rare" characters assigned to astral planes being common dictionary lookups. OTOH, if it's just a matter of matching short strings pretty much exactly (and you won't have the hira/kata/halfwit issue with anything in the astral planes, I think), you can probably just treat surrogates as funny characters that always occur in pairs. Also, UTF-8 represents most of the BMP in 3 octets per character, including all JIS characters. I don't know if this gets awkward for wwwjdic, but it was in Emacs (Mule code has the same kinds of issues as UTF-8 because its character structure was designed to the same specifications of ASCII compatibility and so on). Footnotes: [1] In Python only strings are primitive, and characters are represented by strings of length one. [2] Python 2's 'unicode' type is encoded as UTF-16 internally, although it's treated as an array of 16-bit code units rather than variable width characters. This has worked "good enough" for almost 2 decades. That said, Python 3 moved to a content-dependent fixed-width type. If your string is all ISO-8859-1, it's encoded as an array of octets. If it contains even one astral character, it's UTF-32. everything else is UCS-2 (aka the subset of UTF-16 excluding surrogates). Be warned guys: even one emoji in a Python 3 str will double, maybe quadruple, the memory requirements. Hm. I think I see a DoS attack ... just append %G👺%@!
- Follow-Ups:
- Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- From: Jim Breen
- References:
- [tlug] Kudos to Jim Breen
- From: Stephen J. Turnbull
- [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- From: Stephen J. Turnbull
- Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- From: Kalin KOZHUHAROV
- Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- From: Jim Breen
- Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- From: grb
- Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- From: Stephen J. Turnbull
- Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- From: Jim Breen
Home | Main Index | Thread Index
- Prev by Date: [tlug] Rikaichan in Firefox
- Next by Date: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- Previous by thread: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- Next by thread: Re: [tlug] Bogus Japanese zipfiles [was: Kudos to Jim Breen]
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links