Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] searching for kanji strings, ignore punctuation and endof lines



David Riggs wrote:
> To clarify: this all assumes utf-8 in data and in locale.
>
> I have a quote that is just a
> string of kanji, and I am looking for where it came from. I do have an
> etext version of the canon (several hundred megabytes and thousands of
> files), in utf8, which most like contains this phrase.
>
> The problem is that the etexts inserts a special "space" or a maru
> (i.e. a unicode period, little circle) at random places, trying to
> make it easier to read, and making it impossible to find with grep.
>
> I can assume that two lines is enough to look at, and there is
> actually no ascii white space, just those two unicode characters that
> get in the way.
>
> Example, using ABCDEF for a six kanji phrase I am looking for, and
> "ghijklmnopq..." for other kanji that happen to be on the line.  And
> "." for the maru:
>
> p0001a05(00)-ghi.jklmn.op.rs.AB.
> p0001a06(00)-CD.EFtuvw.xyz.
Since you have to do it with a one line perl script.
echo file.txt | perl -0777 -nle
's/((?:\np0001a0..00..)*[^\n]*A(?:\np0001a0..00..|\.)*B(?:\np0001a0..00..|\.)*C(?:\np0001a0..00..|\.)*D(?:\np0001a0..00..|\.)*E(?:\np0001a0..00..|\.)*F[^\n]*)/\n--start--\1\n--finish--/m;print'

will give you the lines bracketed by
--start--
p0001a05(00)-ghi.jklmn.op.rs.AB.
p0001a06(00)-CD.EFtuvw.xyz.
--finish--

Edward


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links