TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] Re: tlug-digest Digest V2006 #28

Date: Thu, 19 Jan 2006 09:49:20 +0900

From: David Riggs <dariggs@example.com>

Subject: [tlug] Re: tlug-digest Digest V2006 #28

References: <200601181740.k0IHegS1024608@example.com>

User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
> [tlug-digest] Re: [tlug] searching for kanji strings, ignore punctuation 
> "Stephen J. Turnbull" <stephen@example.com>

>>>>>>"David" == David Riggs <dariggs@example.com> writes:
> Perl probably has a split function; make the kanji string a varaible
> (see below for why), and split it on "" which will give you an array
> of characters.  Then do a join with "\$w".
> 
> (defun mung-run-perl (kanji)--snip--
--

Thanks for the lisp. Just the trick!

> 
> But 60 seconds is a long time.  You really should find some way to get
> this indexed.  Is there any restriction on the strings, or are they
> basically arbitrary sequences of CJK ideographs?

No restriction on the CJK ideographs, which I wish to see as simply a 
sequence of CJK region utf-8. The key thing is that all the added noise 
(which in another context is extremely valuable markup), must be 
ignored. I typically have a "quote" from an unknown text, which my guy 
(writing in the mid-Edo period) is commenting on. He just plops down the 
string of kanji, the way it really was. I.e. no punctuation, breaks, or 
anything, which is an addition to the "real" text. And I need to find 
that same string. The kindly added "maru" space, line numbers are just 
noise for this purpose.

It seems likely that there is something to do this with, somewhere? Of 
course I expect the index to be bigger than the data, but whats a 
gigabyte or two when you are searching a canon? The additional speed 
would be worth the space, and the all-night run to index it. The 
Buddhist canon does not change very rapidly.


Thanks,

David Riggs
Follow-Ups:

Re: [tlug] Re: tlug-digest Digest V2006 #28
From: sjs

Re: [tlug] Re: tlug-digest Digest V2006 #28
From: Stephen J. Turnbull

Prev by Date: Re: [tlug] Base64 and headers (was: Editing Soud Files (WAV & MP3))

Next by Date: Re: [tlug] Re: tlug-digest Digest V2006 #28

Previous by thread: [tlug] Make GMail Follow Polysaturated Threads. (Was ... you know the drill)

Next by thread: Re: [tlug] Re: tlug-digest Digest V2006 #28

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links