Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] searching for kanji strings, ignore punctuation and endof lines



>>>>> "David" == David Riggs <dariggs@example.com> writes:

    David> The line numbers are easy to ignore, thay are a fixed set
    David> of [-0-9()pabc], and the output of grep will include the
    David> file name, also a fixed set of 0-9 and ascii letters.  BUT,
    David> I need to get that file name and the line number!

egrep will emit both the file name and the line number on each line if
there are multiple files and the -n flag.

Since the line numbering and newlines are ASCII, in perl (python,
ruby, elisp) you could do

[[:kanji1:]][\000-\177[:maru:]]*[[:kanji2:]][\000-\0177[:maru:]]*[[:kanji3]]

where [:xyz:] is pseudo-code for a specific named character.  This
will wrap around lines, since \012 is in the ASCII range.  Writing the
perl to take a string and convert it to a regexp like the above is
beyond me, though (perl is a 4-letter word, that's why I use egrep and
elisp).

    David> But, from the silence on the second part of my question, I
    David> guess there is no pre-index program that would handle this
    David> kind of thing and do it in a flash? Even a simple search on
    David> my data set takes a while.

namazu and FreeWAIS come to mind.  namazu is pretty common, Frank
Bennett at Nagoya U is an expert on FreeWAIS.  Be aware that your
indicies are likely to be bigger than your corpus unless you're very
slick with data structures.

You might also look at agrep, but last I checked it didn't know about
multibyte characters.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links