Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Re: tlug-digest Digest V2006 #28




> [tlug-digest] Re: [tlug] searching for kanji strings, ignore punctuation 
> "Stephen J. Turnbull" <stephen@example.com>

>>>>>>"David" == David Riggs <dariggs@example.com> writes:
> Perl probably has a split function; make the kanji string a varaible
> (see below for why), and split it on "" which will give you an array
> of characters.  Then do a join with "\$w".
> 
> (defun mung-run-perl (kanji)--snip--
--

Thanks for the lisp. Just the trick!

> 
> But 60 seconds is a long time.  You really should find some way to get
> this indexed.  Is there any restriction on the strings, or are they
> basically arbitrary sequences of CJK ideographs?

No restriction on the CJK ideographs, which I wish to see as simply a 
sequence of CJK region utf-8. The key thing is that all the added noise 
(which in another context is extremely valuable markup), must be 
ignored. I typically have a "quote" from an unknown text, which my guy 
(writing in the mid-Edo period) is commenting on. He just plops down the 
string of kanji, the way it really was. I.e. no punctuation, breaks, or 
anything, which is an addition to the "real" text. And I need to find 
that same string. The kindly added "maru" space, line numbers are just 
noise for this purpose.

It seems likely that there is something to do this with, somewhere? Of 
course I expect the index to be bigger than the data, but whats a 
gigabyte or two when you are searching a canon? The additional speed 
would be worth the space, and the all-night run to index it. The 
Buddhist canon does not change very rapidly.


Thanks,

David Riggs



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links