Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: namazu



>>>>> "Selva" == Selva Nair <selva@example.com> writes:

    Selva>     Yeah, inclusion of data for partial matches could be
    Selva> why the number of key words is that huge.

Oh, yeah, Frank said something about kakasi.  I bet a lot of that bulk
comes from yomi, which are indexed along with the kanji.

    Selva> Of course. That's why I thought it can't be 386K, but,
    Selva> IMHO, 386M is a bloated index. Almost 450 bytes per
    Selva> keyword, it is !

A lot of applications I've used have several times the text content in
the indicies.  Often stripping a binary will halve its size ;-)  See
the calculation below.

    Selva> But why to store phrases. Strorage of a reference each to
    Selva> the preceding and following words should be enough to
    Selva> construct phrase info. No?

This only works for phrases <= 3 words.  Remember, every word is
likely to be included in multiple phrases.

    Selva> Info like file names shouldn't take much space at all, as
    Selva> 2000 odd filenames can be indexed by a two byte word. I
    Selva> would buy 100 bytes per keyword, but not 400.

This is true.  I would say that a very compact representation might
have 8 bytes (two unsigned ints) per reference, one an index into a
table of file names and one a byte index into the file itself.  (If
you expect lots of multiple references in files you could get more
compact by keeping lists of byte indicies for each file.)  It seems
unlikely that there would be an average of 50 references per key, but
it could be.  Eg, if you indexed every (2-byte) character that way,
you'd end up with an index 4X as big as the source text, plus a
couple of extra kB for the individual characters themselves.

Also, if you index by line number as well as byte count, you could
imagine a 12-byte-per-reference format.

    Selva> Or am I underestimating the information content of an
    Selva> index? Or is it that tightly stored indices wont give good
    Selva> performance at the search time?

I don't see why this would be a problem.  The index should be
hierarchically structured, so you search a b-tree or so, pulling in
100kB blocks of nodes as needed.  One-time reading in a 100kB block
containing the path name table (50 bytes per path is probably about
6-7 directories deep) is cheap in that context.  Negligible for
multiple searches.  Since it's an array, all the cost is in reading it
in from disk.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."
--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links