Re: [tlug] Nasty Problem: searching for strings that span newlines

Date: Fri, 13 Jan 2006 23:14:48 -0500
From: Jim <jep200404@example.com>
Subject: Re: [tlug] Nasty Problem: searching for strings that span newlines
References: <200601130511.k0D5BxWg015897@example.com><43C84B5A.7000703@example.com>

David Riggs wrote:

> I need to find short ... strings in a giant haystack of texts. Grep 
> does not work because the haystack (the CBETA canon of Buddhist texts) 
> adds punctuation characters, and inserts newline characters and line 
> numbers.

> ... string which spans a new line, ... I am not sure that sed ... 

Matches spanning newlines is the hard part. 

Punctuation is not a big deal. Line numbers are a bit more work, 
but spanning newlines is the nasty part. There are workarounds. 
Many are awkward. For the cleaner solutions, see "Anchoring", p81 of 

   Mastering Regular Expressions (MRE) by Jeffrey E. F. Friedl
   http://www.oreilly.com/catalog/regex/

Strings spanning newlines is a common limitation of classic regex tools. 
MRE says that in ed and grep, the regex always works one line at a time. 
(I.e., don't play well with strings that span lines.) 
"Other tools, however, often allow arbitrary text to be matched." 

p82 of MRE has a chart that shows that 
dot matches a newline for Tcl, sed, and awk. 

p81 of MRE: "Perl can do both line- and string-oriented matches, ..."

I would look hard at those tools, especially Perl. Perl regexes are 
known for being more powerful than most other regex flavors. 
There are many subtleties in regex stuff, 
especially with the serious complication of spanning newlines. 

> One approach is to make a pipeline to grep for the first kanji, strip 
> out the punctuation characters with sed, and search again for the rest 
> of the kanji:
> 
> grep first-ji * | sed  -e 's/[,;]//g' | grep later-kanjis

   grep 'first-ji[,;]*later-jis' 

would be as effective and have fewer false positives 
(such as when later-kanjis precede first-ji on a line), 
but neither your grep nor my grep would find strings that span newlines. 
I'm not good at regexes, so someone else will have to correct 
my regex mistakes. If you can get sed to suck in a whole file 
in one gulp (as opposed to line at a time), I would consider something like: 

   sed -e 's/first-ji((,|;|\n\w*[0-9]+:\w*)*)second-ji((,|;|\n\w*[0-9]+:\w*)*)third-ji/markerfirst-ji\1second-ji\3third-ji/g' <file \
   | grep -A 2 marker | sed -e 's/marker//g'

where marker is some rather unique string. I might also consider something like: 

   tr -d ';,' <file | sed -e 's/^\w*[0-9]*:+\w*//g' \
   | tr '\n' '%' | sed -e 's/first-ji(%*)second-ji(%*)third-ji/markerfirst-ji\1second-ji\2third-ji/g' \
   | tr '%' '\n' | grep -A 2 'marker' | sed -e 's/marker//g'

of course, that doesn't handle '%'s in the file well.    

The following will strip the newlines (or any other single byte 
characters) easily enough: 

   tr -d ',;\n\r' | grep first-jilater-kanjis...

but the lines would get too long for grep. 
Also, if each file becomes one big long line, 
then grep would always report a match as being on line 1, 
which is not very meaningful. 

I would NOT use tr to delete multi-byte characters. 
tr is one tool for which I would expect UTF-8 to not 
solve the multi-byte character issues. 
However, tr would be fine for deleting single byte characters 
from a UTF-8 stream. If all your punctuation and newline 
characters that you would consider deleting are single 
byte characters in UTF-8, then tr would be OK.  

> I need to find short ... strings in a giant haystack of texts. Grep 
> does not work because the haystack (the CBETA canon of Buddhist texts) 
> adds punctuation characters, and inserts newline characters and line 
> numbers.

Examples of such short strings and a good size (few paragraphs) 
chunk of the text (with annoying punctuation and newline characters 
and line numbers) you're searching through would help. The more 
we know about your data, the more the searches can be optimized. 

By the way, Jeffrey E. F. Friedl worked in Kyoto for eight years 
and has some CJK sprinkled through the book. 

I don't know regexes well, so I look forward learning from 
others' corrections of my junk regexes and others' 
completely different solutions that are just better.

Follow-Ups:
- Re: [tlug] Searching: Understanding the Problem Better
  - From: Jim
- Re: [tlug] Nasty Problem: searching for strings that span newlines
  - From: Ian Wells

References:
- [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
  - From: David Riggs

Prev by Date: Re: [tlug] Linspire as a senior citizen's first computer... whatdoyou think?
Next by Date: Re: [tlug] Searching: Understanding the Problem Better
Previous by thread: Re: [tlug] [tlug-digest] Regex Efficiency
Next by thread: Re: [tlug] Searching: Understanding the Problem Better
Index(es):
- Date
- Thread

Home | Main Index | Thread Index