
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Searching for kanji strings: Use UTF-8
The first answer is: 
   Use UTF-8. 
David Riggs wrote:
> I need to find short kanji strings in a giant haystack of texts. 
> I am not sure that sed is really doing a character replacement 
> (the real punctuation is unicode two byte maru and space). If it is 
> doing a byte-by-byte replacement, it could mangle kanji by taking the 
> second byte of one and the first byte of the following ji.
> On the other hand, instead of searching each time, is there a text 
> indexing and search system which works with unicode? All I find googling 
> around is commerical stuff which seems orientated towards western languages.
Use UTF-8 for all strings involved. 
UNIX gurus Thompson & Pike anticipated exactly the "out of sync" 
situation you (correctly!) worry about, and designed UTF-8 as the 
solution to make multi-byte characters play well with the classic 
UNIX filters that just think "one byte at a time". 
   http://en.wikipedia.org/wiki/UTF-8
Their solution is elegant, just as one would expect of UNIX. 
------------------------------------------------------------------------------
I'll address the nastier newline spanning issue separately. 
Home |
Main Index |
Thread Index