Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Nasty Problem: searching for strings that span newlines
- Date: Fri, 13 Jan 2006 23:14:48 -0500
- From: Jim <jep200404@example.com>
- Subject: Re: [tlug] Nasty Problem: searching for strings that span newlines
- References: <200601130511.k0D5BxWg015897@example.com><43C84B5A.7000703@example.com>
David Riggs wrote: > I need to find short ... strings in a giant haystack of texts. Grep > does not work because the haystack (the CBETA canon of Buddhist texts) > adds punctuation characters, and inserts newline characters and line > numbers. > ... string which spans a new line, ... I am not sure that sed ... Matches spanning newlines is the hard part. Punctuation is not a big deal. Line numbers are a bit more work, but spanning newlines is the nasty part. There are workarounds. Many are awkward. For the cleaner solutions, see "Anchoring", p81 of Mastering Regular Expressions (MRE) by Jeffrey E. F. Friedl http://www.oreilly.com/catalog/regex/ Strings spanning newlines is a common limitation of classic regex tools. MRE says that in ed and grep, the regex always works one line at a time. (I.e., don't play well with strings that span lines.) "Other tools, however, often allow arbitrary text to be matched." p82 of MRE has a chart that shows that dot matches a newline for Tcl, sed, and awk. p81 of MRE: "Perl can do both line- and string-oriented matches, ..." I would look hard at those tools, especially Perl. Perl regexes are known for being more powerful than most other regex flavors. There are many subtleties in regex stuff, especially with the serious complication of spanning newlines. > One approach is to make a pipeline to grep for the first kanji, strip > out the punctuation characters with sed, and search again for the rest > of the kanji: > > grep first-ji * | sed -e 's/[,;]//g' | grep later-kanjis grep 'first-ji[,;]*later-jis' would be as effective and have fewer false positives (such as when later-kanjis precede first-ji on a line), but neither your grep nor my grep would find strings that span newlines. I'm not good at regexes, so someone else will have to correct my regex mistakes. If you can get sed to suck in a whole file in one gulp (as opposed to line at a time), I would consider something like: sed -e 's/first-ji((,|;|\n\w*[0-9]+:\w*)*)second-ji((,|;|\n\w*[0-9]+:\w*)*)third-ji/markerfirst-ji\1second-ji\3third-ji/g' <file \ | grep -A 2 marker | sed -e 's/marker//g' where marker is some rather unique string. I might also consider something like: tr -d ';,' <file | sed -e 's/^\w*[0-9]*:+\w*//g' \ | tr '\n' '%' | sed -e 's/first-ji(%*)second-ji(%*)third-ji/markerfirst-ji\1second-ji\2third-ji/g' \ | tr '%' '\n' | grep -A 2 'marker' | sed -e 's/marker//g' of course, that doesn't handle '%'s in the file well. The following will strip the newlines (or any other single byte characters) easily enough: tr -d ',;\n\r' | grep first-jilater-kanjis... but the lines would get too long for grep. Also, if each file becomes one big long line, then grep would always report a match as being on line 1, which is not very meaningful. I would NOT use tr to delete multi-byte characters. tr is one tool for which I would expect UTF-8 to not solve the multi-byte character issues. However, tr would be fine for deleting single byte characters from a UTF-8 stream. If all your punctuation and newline characters that you would consider deleting are single byte characters in UTF-8, then tr would be OK. > I need to find short ... strings in a giant haystack of texts. Grep > does not work because the haystack (the CBETA canon of Buddhist texts) > adds punctuation characters, and inserts newline characters and line > numbers. Examples of such short strings and a good size (few paragraphs) chunk of the text (with annoying punctuation and newline characters and line numbers) you're searching through would help. The more we know about your data, the more the searches can be optimized. By the way, Jeffrey E. F. Friedl worked in Kyoto for eight years and has some CJK sprinkled through the book. I don't know regexes well, so I look forward learning from others' corrections of my junk regexes and others' completely different solutions that are just better.
- Follow-Ups:
- References:
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Linspire as a senior citizen's first computer... whatdoyou think?
- Next by Date: Re: [tlug] Searching: Understanding the Problem Better
- Previous by thread: Re: [tlug] [tlug-digest] Regex Efficiency
- Next by thread: Re: [tlug] Searching: Understanding the Problem Better
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links