Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] search for fulltext-searchengine
- Date: Mon, 19 May 2008 17:30:07 +0900
- From: Curt Sampson <cjs@example.com>
- Subject: Re: [tlug] search for fulltext-searchengine
- References: <20080519024400.GA4769@fluxcoil.net>
- User-agent: Mutt/1.5.17 (2007-11-01)
On 2008-05-19 04:44 +0200 (Mon), Christian Horn wrote: > namazu.org looks nice for a fulltext-searchengine. We (Starling Software) recently built a new full-text search engine for a web site that searched a relatively small (probably under a million words) corpus of Japanese text. We used namazu for this, so I thought I'd share a few experiences with you. First, the serious pain in the you-know-what for us is that namazu is designed only to read from files. The code for the index generator is a rather nasty, intertwined mess, and we ended up exporting all the data from an RDMBS into a file hierarchy with files named for primary keys from the database, running the generator, and then post-processing the results to generate the particular fields we needed to be returned from the search query. For you, however, the standard namazu way of building the search index might actually work, since it does have the ability to have various parsers for different kinds of files, based on MIME type, so you can probably get it to index PDF, Word etc. files without too serious an amount of work. We also customized the Ruby version of the query engine to do our own queries, pulling out only the particular data we needed. That was a lot easier than trying to deal with the index generator, but the code was still pretty poorly structured, and we ended up learning it right down to the index format and replacing many parts of the query code completely. Namazu at this point appears to easily support only EUC-JP. We'll probably be looking at doing a UTF-8 version of it at some point, but I'm not looking forward to that. So basically, if you want to use exactly what namazu comes with, you may merely have to deal with some sysadmin pain of setting it up. If it doesn't work quite as you want, you could well have a somewhat nasty programming task in front of you. We might be interested on collaborating with someone working on a better version of Namazu; in particular one that's much more modular. But you'd want to bring either a bunch of programming man-hours or some money to the table; we're not very interested in doing most of it all by ourselves at this point. One thing that would really help would be documentation of the namazu file formats and algorithms to build the index and do lookups. We have some documentation for some of this, if anybody wants it. cjs -- Curt Sampson <cjs@example.com> +81 90 7737 2974 Mobile sites and software consulting: http://www.starling-software.com
- References:
- [tlug] search for fulltext-searchengine
- From: Christian Horn
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] OT: interesting NY times article:High-Tech Japanese, Running Out of Engineers
- Next by Date: Re: [tlug] Managing PGP keys on multiple machines
- Previous by thread: [tlug] search for fulltext-searchengine
- Next by thread: [tlug] search for fulltext-searchengine
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links