Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Searching for kanji strings: Use UTF-8
- Date: Fri, 13 Jan 2006 21:14:52 -0500
- From: Jim <jep200404@example.com>
- Subject: Re: [tlug] Searching for kanji strings: Use UTF-8
- References: <200601130511.k0D5BxWg015897@example.com><43C84B5A.7000703@example.com>
The first answer is: Use UTF-8. David Riggs wrote: > I need to find short kanji strings in a giant haystack of texts. > I am not sure that sed is really doing a character replacement > (the real punctuation is unicode two byte maru and space). If it is > doing a byte-by-byte replacement, it could mangle kanji by taking the > second byte of one and the first byte of the following ji. > On the other hand, instead of searching each time, is there a text > indexing and search system which works with unicode? All I find googling > around is commerical stuff which seems orientated towards western languages. Use UTF-8 for all strings involved. UNIX gurus Thompson & Pike anticipated exactly the "out of sync" situation you (correctly!) worry about, and designed UTF-8 as the solution to make multi-byte characters play well with the classic UNIX filters that just think "one byte at a time". http://en.wikipedia.org/wiki/UTF-8 Their solution is elegant, just as one would expect of UNIX. ------------------------------------------------------------------------------ I'll address the nastier newline spanning issue separately.
- Follow-Ups:
- Re: [tlug] Use a shell that groks UTF-8
- From: Jim
- References:
Home | Main Index | Thread Index
- Prev by Date: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
- Next by Date: Re: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
- Previous by thread: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
- Next by thread: Re: [tlug] Use a shell that groks UTF-8
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links