
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Searching for kanji strings: Use UTF-8
The first answer is:
Use UTF-8.
David Riggs wrote:
> I need to find short kanji strings in a giant haystack of texts.
> I am not sure that sed is really doing a character replacement
> (the real punctuation is unicode two byte maru and space). If it is
> doing a byte-by-byte replacement, it could mangle kanji by taking the
> second byte of one and the first byte of the following ji.
> On the other hand, instead of searching each time, is there a text
> indexing and search system which works with unicode? All I find googling
> around is commerical stuff which seems orientated towards western languages.
Use UTF-8 for all strings involved.
UNIX gurus Thompson & Pike anticipated exactly the "out of sync"
situation you (correctly!) worry about, and designed UTF-8 as the
solution to make multi-byte characters play well with the classic
UNIX filters that just think "one byte at a time".
http://en.wikipedia.org/wiki/UTF-8
Their solution is elegant, just as one would expect of UNIX.
------------------------------------------------------------------------------
I'll address the nastier newline spanning issue separately.
Home |
Main Index |
Thread Index