TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

Date: Mon, 16 Jan 2006 18:56:27 +0900

From: "Stephen J. Turnbull" <stephen@example.com>

Subject: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

References: <43CB4F48.1060200@example.com>

Organization: The XEmacs Project

User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b24 (dandelion, linux)
>>>>> "David" == David Riggs <dariggs@example.com> writes:

    David> The line numbers are easy to ignore, thay are a fixed set
    David> of [-0-9()pabc], and the output of grep will include the
    David> file name, also a fixed set of 0-9 and ascii letters.  BUT,
    David> I need to get that file name and the line number!

egrep will emit both the file name and the line number on each line if
there are multiple files and the -n flag.

Since the line numbering and newlines are ASCII, in perl (python,
ruby, elisp) you could do

[[:kanji1:]][\000-\177[:maru:]]*[[:kanji2:]][\000-\0177[:maru:]]*[[:kanji3]]

where [:xyz:] is pseudo-code for a specific named character.  This
will wrap around lines, since \012 is in the ASCII range.  Writing the
perl to take a string and convert it to a regexp like the above is
beyond me, though (perl is a 4-letter word, that's why I use egrep and
elisp).

    David> But, from the silence on the second part of my question, I
    David> guess there is no pre-index program that would handle this
    David> kind of thing and do it in a flash? Even a simple search on
    David> my data set takes a while.

namazu and FreeWAIS come to mind.  namazu is pretty common, Frank
Bennett at Nagoya U is an expert on FreeWAIS.  Be aware that your
indicies are likely to be bigger than your corpus unless you're very
slick with data structures.

You might also look at agrep, but last I checked it didn't know about
multibyte characters.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.
References:

[tlug] searching for kanji strings, ignore punctuation and end of lines
From: David Riggs

Prev by Date: [tlug] [C&C] Nasty Problem: this is worth acquiring a good mail reader ;-)

Next by Date: [tlug] [tlug-digest] re: searching for kanji strings, ignore punctuation and end of lines

Previous by thread: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines

Next by thread: [tlug] [tlug-digest] re: searching for kanji strings, ignore punctuation and end of lines

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links