Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- Date: Mon, 16 Jan 2006 18:04:19 +0900
- From: Edward Middleton <edward@example.com>
- Subject: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- References: <43CB4F48.1060200@example.com>
- User-agent: Mail/News 1.5 (X11/20060113)
David Riggs wrote: > To clarify: this all assumes utf-8 in data and in locale. > > I have a quote that is just a > string of kanji, and I am looking for where it came from. I do have an > etext version of the canon (several hundred megabytes and thousands of > files), in utf8, which most like contains this phrase. > > The problem is that the etexts inserts a special "space" or a maru > (i.e. a unicode period, little circle) at random places, trying to > make it easier to read, and making it impossible to find with grep. > > I can assume that two lines is enough to look at, and there is > actually no ascii white space, just those two unicode characters that > get in the way. > > Example, using ABCDEF for a six kanji phrase I am looking for, and > "ghijklmnopq..." for other kanji that happen to be on the line. And > "." for the maru: > > p0001a05(00)-ghi.jklmn.op.rs.AB. > p0001a06(00)-CD.EFtuvw.xyz. Since you have to do it with a one line perl script. echo file.txt | perl -0777 -nle 's/((?:\np0001a0..00..)*[^\n]*A(?:\np0001a0..00..|\.)*B(?:\np0001a0..00..|\.)*C(?:\np0001a0..00..|\.)*D(?:\np0001a0..00..|\.)*E(?:\np0001a0..00..|\.)*F[^\n]*)/\n--start--\1\n--finish--/m;print' will give you the lines bracketed by --start-- p0001a05(00)-ghi.jklmn.op.rs.AB. p0001a06(00)-CD.EFtuvw.xyz. --finish-- Edward
- References:
- [tlug] searching for kanji strings, ignore punctuation and end of lines
- From: David Riggs
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- Next by Date: [tlug] [cocoa & kittens warning] Make Web Mail Server that Follows Polysaturated Threads
- Previous by thread: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- Next by thread: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links