Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] search for fulltext-searchengine



On 2008-05-19 04:44 +0200 (Mon), Christian Horn wrote:

> namazu.org looks nice for a fulltext-searchengine.

We (Starling Software) recently built a new full-text search engine for
a web site that searched a relatively small (probably under a million
words) corpus of Japanese text. We used namazu for this, so I thought I'd
share a few experiences with you.

First, the serious pain in the you-know-what for us is that namazu is
designed only to read from files. The code for the index generator is a
rather nasty, intertwined mess, and we ended up exporting all the data
from an RDMBS into a file hierarchy with files named for primary keys
from the database, running the generator, and then post-processing the
results to generate the particular fields we needed to be returned from
the search query.

For you, however, the standard namazu way of building the search index
might actually work, since it does have the ability to have various
parsers for different kinds of files, based on MIME type, so you can
probably get it to index PDF, Word etc. files without too serious an
amount of work.

We also customized the Ruby version of the query engine to do our own
queries, pulling out only the particular data we needed. That was a
lot easier than trying to deal with the index generator, but the code
was still pretty poorly structured, and we ended up learning it right
down to the index format and replacing many parts of the query code
completely.

Namazu at this point appears to easily support only EUC-JP. We'll
probably be looking at doing a UTF-8 version of it at some point, but
I'm not looking forward to that.

So basically, if you want to use exactly what namazu comes with, you
may merely have to deal with some sysadmin pain of setting it up. If it
doesn't work quite as you want, you could well have a somewhat nasty
programming task in front of you.

We might be interested on collaborating with someone working on a better
version of Namazu; in particular one that's much more modular. But you'd
want to bring either a bunch of programming man-hours or some money
to the table; we're not very interested in doing most of it all by
ourselves at this point.

One thing that would really help would be documentation of the namazu
file formats and algorithms to build the index and do lookups. We have
some documentation for some of this, if anybody wants it.

cjs
-- 
Curt Sampson       <cjs@example.com>        +81 90 7737 2974   
Mobile sites and software consulting: http://www.starling-software.com


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links