- Date: Wed, 18 Jan 2006 17:28:06 +0900
- Subject: [tlug] searching for kanji strings, ignore punctuation and end of lines: Perl Solution and comments
Thanks for help from Edward, Steven, Josh et al. I have a solution to my suprisingly pesky problem, (see the solution after the >>). To recap the problem: <<review--------- I have a quote that is just a string of kanji, and I am looking for where it came from. I do have an etext version of the canon (several hundred megabytes and thousands of files), in utf8, which most likely contains this phrase. The problem is that the etexts inserts a special "space" or a maru (i.e. a unicode period, little circle) at random places, trying to make it easier to read, and making it impossible to find with grep, and breaks lines at unlikely places. I can assume that two lines is enough to look at, and there is actually no ascii white spaces, just those two unicode characters that get in the way. Example, using ABCDEF for a six kanji phrase I am looking for, and "ghijklmnopq..." for other kanji that happen to be on the line. And "." for the maru: p0001a05(00) p0001a06(00) If you are set to unicode, here is a real snippet from the CBETA canon: p0001b16(00)| 念彌勒佛緣 念佛三昧緣 p0001b17(00)| 普敬述意緣第一 p0001b18(00)|夫大聖有平等之相。弟子有稱揚之德。 p0001b19(02)|故十方諸佛。同出於淤泥之濁。三身正覺。 And I am searching for e.g. 揚之德故十 , which goes over line breaks and maru. >>end of review---------------------- Solution: Following Steven's (and others) general approach, I simply make a search argument with optional puntuation, newline, line number characters between each and every kanji (the $w = [--] below). The hard part was that newline processing is not taken care of by perl in quite such an easy way. In fact, the text is from DOS and hence has DOS \015\012 line breaks. Looking at this in emacs it shows as simple \012, but perl sees and insists on having both \015 and \012 specified. As has so often happened to me, I get all twisted up with new lines, especially when crossing platforms. Once I figured that out, it was just a matter of learning enough perl to figure out the syntax. I slurp in the whole file with -0777 (thanks Edward), and set my special ignore-this string to $w in the BEGIN, then looped over all the files globbed (thanks to the -n switch). The /xo switches in the perl match is so I can put in white space for readability, and to not recompile the search arguent each time for the $w variable. I do have to remember to print out the name of the current file $ARGV! Here is my little perl-lette, already set for a particular search (not the one above, sorry). Put in my ~/bin and invoked with: -> cbsearch fileglob #!/usr/bin/perl -0777 -n BEGIN {$w = '[0-9pabc()|。 \n\015]*'} if (/\n$w.* 相$w弟$w子$w有$w稱$w揚$w之$w德$w 故十 .*/xo){print $&, "\n";} It prints out (for this example), the file name and the text in the file: t54n2123.txt p0001b18(00)|夫大聖有平等之相。弟子有稱揚之德。 p0001b19(02)|故十方諸佛。同出於淤泥之濁。三身正覺。 This kind of thing, to put it mildly, is fabulously useful to me. The ugly part is that I have to go edit the perl script file each time, and do a little emacs deal to insert the $w between each kanji. Still, it works! But hmm, slow. A good 60 seconds for the above example, on my three year old Toshiba laptop. Any suggestions about speeding up would be appreciated. I have looked at Namazu a bit, but its not clear to me that it is set up for this kind of thing. Its not really words we are talking about here, and the point is to ignore punctuation, not use it to make syntatic units like Namazu does. (These texts are not punctuated in the original, and old writers quote them either without punctuation or making up some of their own.) Steven, are you serious, can you do something like this with egrep and elisp? That would be great. I would love to hear more. Thanks everyone, especially all the perl from Edward. David Riggs, Kyoto
