TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

Date: Mon, 16 Jan 2006 18:04:19 +0900

From: Edward Middleton <edward@example.com>

Subject: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

References: <43CB4F48.1060200@example.com>

User-agent: Mail/News 1.5 (X11/20060113)
David Riggs wrote:
> To clarify: this all assumes utf-8 in data and in locale.
>
> I have a quote that is just a
> string of kanji, and I am looking for where it came from. I do have an
> etext version of the canon (several hundred megabytes and thousands of
> files), in utf8, which most like contains this phrase.
>
> The problem is that the etexts inserts a special "space" or a maru
> (i.e. a unicode period, little circle) at random places, trying to
> make it easier to read, and making it impossible to find with grep.
>
> I can assume that two lines is enough to look at, and there is
> actually no ascii white space, just those two unicode characters that
> get in the way.
>
> Example, using ABCDEF for a six kanji phrase I am looking for, and
> "ghijklmnopq..." for other kanji that happen to be on the line.  And
> "." for the maru:
>
> p0001a05(00)-ghi.jklmn.op.rs.AB.
> p0001a06(00)-CD.EFtuvw.xyz.
Since you have to do it with a one line perl script.
echo file.txt | perl -0777 -nle
's/((?:\np0001a0..00..)*[^\n]*A(?:\np0001a0..00..|\.)*B(?:\np0001a0..00..|\.)*C(?:\np0001a0..00..|\.)*D(?:\np0001a0..00..|\.)*E(?:\np0001a0..00..|\.)*F[^\n]*)/\n--start--\1\n--finish--/m;print'

will give you the lines bracketed by
--start--
p0001a05(00)-ghi.jklmn.op.rs.AB.
p0001a06(00)-CD.EFtuvw.xyz.
--finish--

Edward
References:

[tlug] searching for kanji strings, ignore punctuation and end of lines
From: David Riggs

Prev by Date: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

Next by Date: [tlug] [cocoa & kittens warning] Make Web Mail Server that Follows Polysaturated Threads

Previous by thread: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

Next by thread: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links