Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: namazu
- To: tlug@example.com
- Subject: Re: tlug: namazu
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Mon, 7 Feb 2000 16:58:15 +0900 (JST)
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <Pine.LNX.3.95.1000207141226.11974A-100000@example.com>
- References: <14494.19709.87713.38966@example.com><Pine.LNX.3.95.1000207141226.11974A-100000@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
>>>>> "Selva" == Selva Nair <selva@example.com> writes: Selva> Yeah, inclusion of data for partial matches could be Selva> why the number of key words is that huge. Oh, yeah, Frank said something about kakasi. I bet a lot of that bulk comes from yomi, which are indexed along with the kanji. Selva> Of course. That's why I thought it can't be 386K, but, Selva> IMHO, 386M is a bloated index. Almost 450 bytes per Selva> keyword, it is ! A lot of applications I've used have several times the text content in the indicies. Often stripping a binary will halve its size ;-) See the calculation below. Selva> But why to store phrases. Strorage of a reference each to Selva> the preceding and following words should be enough to Selva> construct phrase info. No? This only works for phrases <= 3 words. Remember, every word is likely to be included in multiple phrases. Selva> Info like file names shouldn't take much space at all, as Selva> 2000 odd filenames can be indexed by a two byte word. I Selva> would buy 100 bytes per keyword, but not 400. This is true. I would say that a very compact representation might have 8 bytes (two unsigned ints) per reference, one an index into a table of file names and one a byte index into the file itself. (If you expect lots of multiple references in files you could get more compact by keeping lists of byte indicies for each file.) It seems unlikely that there would be an average of 50 references per key, but it could be. Eg, if you indexed every (2-byte) character that way, you'd end up with an index 4X as big as the source text, plus a couple of extra kB for the individual characters themselves. Also, if you index by line number as well as byte count, you could imagine a 12-byte-per-reference format. Selva> Or am I underestimating the information content of an Selva> index? Or is it that tightly stored indices wont give good Selva> performance at the search time? I don't see why this would be a problem. The index should be hierarchically structured, so you search a b-tree or so, pulling in 100kB blocks of nodes as needed. One-time reading in a 100kB block containing the path name table (50 bytes per path is probably about 6-7 directories deep) is cheap in that context. Negligible for multiple searches. Since it's an array, all the cost is in reading it in from disk. -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules." -------------------------------------------------------------------- Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae Next Technical Meeting: March 11 (Sat) 13:00 Temple University Japan * Topic: TBD -------------------------------------------------------------------- more info: http://www.tlug.gr.jp Sponsor: Global Online Japan
- References:
- Re: tlug: namazu
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Re: tlug: namazu
- From: Selva Nair <selva@example.com>
Home | Main Index | Thread Index
- Prev by Date: RE: tlug: namazu
- Next by Date: tlug: Big uptime
- Prev by thread: Re: tlug: namazu
- Next by thread: RE: tlug: namazu
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links