Mailing List Archive Mailing List
tlug archive
tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] Re: tlug-digest Digest V2006 #28
- Date: Thu, 19 Jan 2006 09:49:20 +0900
- From: David Riggs <>
- Subject: [tlug] Re: tlug-digest Digest V2006 #28
- References: <>
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
> [tlug-digest] Re: [tlug] searching for kanji strings, ignore punctuation > "Stephen J. Turnbull" <> >>>>>>"David" == David Riggs <> writes: > Perl probably has a split function; make the kanji string a varaible > (see below for why), and split it on "" which will give you an array > of characters. Then do a join with "\$w". > > (defun mung-run-perl (kanji)--snip-- -- Thanks for the lisp. Just the trick! > > But 60 seconds is a long time. You really should find some way to get > this indexed. Is there any restriction on the strings, or are they > basically arbitrary sequences of CJK ideographs? No restriction on the CJK ideographs, which I wish to see as simply a sequence of CJK region utf-8. The key thing is that all the added noise (which in another context is extremely valuable markup), must be ignored. I typically have a "quote" from an unknown text, which my guy (writing in the mid-Edo period) is commenting on. He just plops down the string of kanji, the way it really was. I.e. no punctuation, breaks, or anything, which is an addition to the "real" text. And I need to find that same string. The kindly added "maru" space, line numbers are just noise for this purpose. It seems likely that there is something to do this with, somewhere? Of course I expect the index to be bigger than the data, but whats a gigabyte or two when you are searching a canon? The additional speed would be worth the space, and the all-night run to index it. The Buddhist canon does not change very rapidly. Thanks, David Riggs
- Follow-Ups:
- Re: [tlug] Re: tlug-digest Digest V2006 #28
- From: sjs
- Re: [tlug] Re: tlug-digest Digest V2006 #28
- From: Stephen J. Turnbull
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Base64 and headers (was: Editing Soud Files (WAV & MP3))
- Next by Date: Re: [tlug] Re: tlug-digest Digest V2006 #28
- Previous by thread: [tlug] Make GMail Follow Polysaturated Threads. (Was ... you know the drill)
- Next by thread: Re: [tlug] Re: tlug-digest Digest V2006 #28
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links