
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- Date: Mon, 16 Jan 2006 18:04:19 +0900
 
- From: Edward Middleton <edward@example.com>
 
- Subject: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
 
- References: <43CB4F48.1060200@example.com>
 
- User-agent: Mail/News 1.5 (X11/20060113)
 
David Riggs wrote:
> To clarify: this all assumes utf-8 in data and in locale.
>
> I have a quote that is just a
> string of kanji, and I am looking for where it came from. I do have an
> etext version of the canon (several hundred megabytes and thousands of
> files), in utf8, which most like contains this phrase.
>
> The problem is that the etexts inserts a special "space" or a maru
> (i.e. a unicode period, little circle) at random places, trying to
> make it easier to read, and making it impossible to find with grep.
>
> I can assume that two lines is enough to look at, and there is
> actually no ascii white space, just those two unicode characters that
> get in the way.
>
> Example, using ABCDEF for a six kanji phrase I am looking for, and
> "ghijklmnopq..." for other kanji that happen to be on the line.  And
> "." for the maru:
>
> p0001a05(00)-ghi.jklmn.op.rs.AB.
> p0001a06(00)-CD.EFtuvw.xyz.
Since you have to do it with a one line perl script.
echo file.txt | perl -0777 -nle
's/((?:\np0001a0..00..)*[^\n]*A(?:\np0001a0..00..|\.)*B(?:\np0001a0..00..|\.)*C(?:\np0001a0..00..|\.)*D(?:\np0001a0..00..|\.)*E(?:\np0001a0..00..|\.)*F[^\n]*)/\n--start--\1\n--finish--/m;print'
will give you the lines bracketed by
--start--
p0001a05(00)-ghi.jklmn.op.rs.AB.
p0001a06(00)-CD.EFtuvw.xyz.
--finish--
Edward
Home |
Main Index |
Thread Index