Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] searching for kanji strings, ignore punctuation and end of lines



Thanks for the helpful suggestions, sorry to be so long in getting back. 
    Still looking for a solution.

My first post:

 > I need to find short kanji strings in a giant haystack of texts. Grep
 > does not work because the haystack (the CBETA canon of Buddhist texts)
 > adds punctuation characters, and inserts newline characters and line
 > numbers.
 >
 > One approach is to make a pipeline to grep for the first kanji, strip
 > out the punctuation characters with sed, and search again for the rest
 > of the kanji:
 >
 > grep first-ji * | sed  -e 's/[,;]//g' | grep later-kanjis
 >
 > This is OK, but does not work for a string which spans a new line, and
 > anyway, I am not sure that sed is really doing a character replacement
 > (the real punctuation is unicode two byte maru and space). If it is
 > doing a byte-by-byte replacement, it could mangle kanji by taking the
 > second byte of one and the first byte of the following ji.
 >
 >
 > Is there a way to do this, preferably a fast way to do this? My haystack
 > is hundreds of megabytes and I have to do it a lot.




To clarify: this all assumes utf-8 in data and in locale.

I have a quote that is just a
string of kanji, and I am looking for where it came from. I do have an
etext version of the canon (several hundred megabytes and thousands of
files), in utf8, which most like contains this phrase.

The problem is that the etexts inserts a special "space" or a maru
(i.e. a unicode period, little circle) at random places, trying to
make it easier to read, and making it impossible to find with grep.

I can assume that two lines is enough to look at, and there is
actually no ascii white space, just those two unicode characters that
get in the way.

Example, using ABCDEF for a six kanji phrase I am looking for, and 
"ghijklmnopq..." for other kanji that happen to be on the line.  And "." 
for the maru:

p0001a05(00)-ghi.jklmn.op.rs.AB.
p0001a06(00)-CD.EFtuvw.xyz.

After finding that first kanji with grep, I need to look for the rest, 
ignoring both the "punctuation" maru and unicode space, and continue 
onto the next line, ignoring the initial line numbers.

The line numbers are easy to ignore, thay are a fixed set of
[-0-9()pabc], and the output of grep will include the file name, also
a fixed set of 0-9 and ascii letters.  BUT, I need to get that file
name and the line number!

Ian, thanks for the perl. I will have to study it, since my perl is
terrible, but I do not think it quite fits.


If I could take a two line unit spat out by grep -A2, then process it
as a separate set, I could do it rather easily. Strip out stuff after 
the match for the first kanji: newline, punctuation, and line numbers. 
Then if there is a match print out the working data area.

But pipelines do not work like that: once out of grep it just a
stream, and I do not see how to chunk it back up again into two line
groups.

Still trying to figure it out. But it does seem that there is hope in 
getting perl to do this. Time to sit down with perl, and time to order 
_Mastering Regular Expressions_ . Sigh, I was hoping this was an obvious 
and easy problem.

But, from the silence on the second part of my question, I guess there
is no pre-index program that would handle this kind of thing and do it 
in a flash? Even a simple search on my data set takes a while.

Thanks everyone,

David Riggs, Kyoto


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links