
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] searching for kanji strings, ignore punctuation and end of lines: Perl Solution and comments
- Date: Wed, 18 Jan 2006 17:28:06 +0900
- From: David Riggs <dariggs@example.com>
- Subject: [tlug] searching for kanji strings, ignore punctuation and end of lines: Perl Solution and comments
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
Thanks for help from Edward, Steven, Josh et al.
I have a solution to my suprisingly pesky problem, (see the solution
after the >>). To recap the problem:
<<review---------
I have a quote that is just a
string of kanji, and I am looking for where it came from. I do have an
etext version of the canon (several hundred megabytes and thousands of
files), in utf8, which most likely contains this phrase.
The problem is that the etexts inserts a special "space" or a maru
(i.e. a unicode period, little circle) at random places, trying to
make it easier to read, and making it impossible to find with grep, and
breaks lines at unlikely places.
I can assume that two lines is enough to look at, and there is
actually no ascii white spaces, just those two unicode characters that
get in the way.
Example, using ABCDEF for a six kanji phrase I am looking for, and
"ghijklmnopq..." for other kanji that happen to be on the line. And "."
for the maru:
p0001a05(00)-ghi.jklmn.op.rs.AB.
p0001a06(00)-CD.EFtuvw.xyz.
If you are set to unicode, here is a real snippet from the CBETA canon:
p0001b16(00)| 念彌勒佛緣 念佛三昧緣
p0001b17(00)| 普敬述意緣第一
p0001b18(00)|夫大聖有平等之相。弟子有稱揚之德。
p0001b19(02)|故十方諸佛。同出於淤泥之濁。三身正覺。
And I am searching for e.g. 揚之德故十 , which goes over line breaks and
maru.
>>end of review----------------------
Solution:
Following Steven's (and others) general approach, I simply make a search
argument with optional puntuation, newline, line number characters
between each and every kanji (the $w = [--] below). The hard part was
that newline processing is not taken care of by perl in quite such an
easy way. In fact, the text is from DOS and hence has DOS \015\012 line
breaks. Looking at this in emacs it shows as simple \012, but perl sees
and insists on having both \015 and \012 specified. As has so often
happened to me, I get all twisted up with new lines, especially when
crossing platforms. Once I figured that out, it was just a matter of
learning enough perl to figure out the syntax.
I slurp in the whole file with -0777 (thanks Edward), and set my special
ignore-this string to $w in the BEGIN, then looped over all the files
globbed (thanks to the -n switch). The /xo switches in the perl match is
so I can put in white space for readability, and to not recompile the
search arguent each time for the $w variable. I do have to remember to
print out the name of the current file $ARGV!
Here is my little perl-lette, already set for a particular search (not
the one above, sorry). Put in my ~/bin and invoked with: -> cbsearch
fileglob
#!/usr/bin/perl -0777 -n
BEGIN {$w = '[0-9pabc()|。 \n\015]*'}
if (/\n$w.*
相$w弟$w子$w有$w稱$w揚$w之$w德$w
故十
.*/xo){print $&, "\n";}
It prints out (for this example), the file name and the text in the file:
t54n2123.txt
p0001b18(00)|夫大聖有平等之相。弟子有稱揚之德。
p0001b19(02)|故十方諸佛。同出於淤泥之濁。三身正覺。
This kind of thing, to put it mildly, is fabulously useful to me.
The ugly part is that I have to go edit the perl script file each time,
and do a little emacs deal to insert the $w between each kanji. Still,
it works!
But hmm, slow. A good 60 seconds for the above example, on my three year
old Toshiba laptop.
Any suggestions about speeding up would be appreciated. I have looked at
Namazu a bit, but its not clear to me that it is set up for this kind of
thing. Its not really words we are talking about here, and the point is
to ignore punctuation, not use it to make syntatic units like Namazu
does. (These texts are not punctuated in the original, and old writers
quote them either without punctuation or making up some of their own.)
Steven, are you serious, can you do something like this with egrep and
elisp? That would be great. I would love to hear more.
Thanks everyone, especially all the perl from Edward.
David Riggs, Kyoto
Home |
Main Index |
Thread Index