Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Searching for kanji strings: Use UTF-8



The first answer is: 

   Use UTF-8. 

David Riggs wrote:

> I need to find short kanji strings in a giant haystack of texts. 

> I am not sure that sed is really doing a character replacement 
> (the real punctuation is unicode two byte maru and space). If it is 
> doing a byte-by-byte replacement, it could mangle kanji by taking the 
> second byte of one and the first byte of the following ji.

> On the other hand, instead of searching each time, is there a text 
> indexing and search system which works with unicode? All I find googling 
> around is commerical stuff which seems orientated towards western languages.

Use UTF-8 for all strings involved. 

UNIX gurus Thompson & Pike anticipated exactly the "out of sync" 
situation you (correctly!) worry about, and designed UTF-8 as the 
solution to make multi-byte characters play well with the classic 
UNIX filters that just think "one byte at a time". 

   http://en.wikipedia.org/wiki/UTF-8

Their solution is elegant, just as one would expect of UNIX. 

------------------------------------------------------------------------------

I'll address the nastier newline spanning issue separately. 



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links