TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Searching for kanji strings: Use UTF-8

Date: Fri, 13 Jan 2006 21:14:52 -0500

From: Jim <jep200404@example.com>

Subject: Re: [tlug] Searching for kanji strings: Use UTF-8

References: <200601130511.k0D5BxWg015897@example.com><43C84B5A.7000703@example.com>
The first answer is: 

   Use UTF-8. 

David Riggs wrote:

> I need to find short kanji strings in a giant haystack of texts. 

> I am not sure that sed is really doing a character replacement 
> (the real punctuation is unicode two byte maru and space). If it is 
> doing a byte-by-byte replacement, it could mangle kanji by taking the 
> second byte of one and the first byte of the following ji.

> On the other hand, instead of searching each time, is there a text 
> indexing and search system which works with unicode? All I find googling 
> around is commerical stuff which seems orientated towards western languages.

Use UTF-8 for all strings involved. 

UNIX gurus Thompson & Pike anticipated exactly the "out of sync" 
situation you (correctly!) worry about, and designed UTF-8 as the 
solution to make multi-byte characters play well with the classic 
UNIX filters that just think "one byte at a time". 

   http://en.wikipedia.org/wiki/UTF-8

Their solution is elegant, just as one would expect of UNIX. 

------------------------------------------------------------------------------

I'll address the nastier newline spanning issue separately. 
Follow-Ups:

Re: [tlug] Use a shell that groks UTF-8
From: Jim

References:

[tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
From: David Riggs

Prev by Date: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.

Next by Date: Re: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.

Previous by thread: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.

Next by thread: Re: [tlug] Use a shell that groks UTF-8

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links