Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: Two Qs re translation project



"Frank Bennett (フランクべネット )" wrote:

> I also have a not-unrelated question that someone (Steve
> Turnbull?) will be able to help with.  The Jse data is stored
> in EUC.  In EUC encoding, could a one-byte search engine
> capable to indexing 8-bit text be used?  In other words,
> if there is a string made up of four bytes:
>
>   [A] [B] [C] [D]
>
> where A and C are the first bytes of two-byte characters
> in EUC-JP encoding, and we run a search using a single-byte
> search engine for a single arbitrary two-byte character, is it
> possible that our character's underlying encoding could
> be [B] [C]?  Or is it logically impossible in EUC-JP
> encoding to get crossed up in this way?
>
> In other words, what are the legal bounds of the first and
> the second bytes in EUC-JP encoding?

A1..FE for both bytes. So yes, it is possible for the trail and header
bytes combined to be misinterpreted as a false positive. Also, if you
index a lot of web data, which tends to use Latin 1 characters even in
English (for degree signs and the occasional accented vowel (Pokemon!
Sake!), you'll probably run into problems there as well, unless you're
sure that the non JIS data will always be ASCII.

UTF-8 doesn't suffer from this problem, btw. By design, the head byte is
structurally different from the tail byte(s) so a 8-bit clean string
search won't deliver a false positive.


--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links