Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.



I need to find short kanji strings in a giant haystack of texts. Grep 
does not work because the haystack (the CBETA canon of Buddhist texts) 
adds punctuation characters, and inserts newline characters and line 
numbers.

One approach is to make a pipeline to grep for the first kanji, strip 
out the punctuation characters with sed, and search again for the rest 
of the kanji:

grep first-ji * | sed  -e 's/[,;]//g' | grep later-kanjis

This is OK, but does not work for a string which spans a new line, and 
anyway, I am not sure that sed is really doing a character replacement 
(the real punctuation is unicode two byte maru and space). If it is 
doing a byte-by-byte replacement, it could mangle kanji by taking the 
second byte of one and the first byte of the following ji.


Is there a way to do this, preferably a fast way to do this? My haystack 
is hundreds of megabytes and I have to do it a lot.

On the other hand, instead of searching each time, is there a text 
indexing and search system which works with unicode? All I find googling 
around is commerical stuff which seems orientated towards western languages.

Thanks,

David Riggs

Kyoto





Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links