Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: namazu



>>>>> "Selva" == Selva Nair <selva@example.com> writes:

    Selva> On Mon, 7 Feb 2000, Tony Laszlo wrote:

    >> Namazu finished indexing 2,644 files (111M).

    >> It took 43.4 hours and created 368M of index

    Selva> Was thinking of installing namazu, but this sounds scary.
    Selva> But wait a minute, a 368M index from a 111M source? Can't
    Selva> be 386K either as you have over 836K keywords. Some
    Selva> misprint somewhere?

No.  A dictionary with 50,000 words at 10 bytes/word should fill
maye 0.5MB?  Uh-uh, a dictionary with no definitions is useless.

The index should be able to compress things somewhat, so simply
mentioning a path and a byte index for each reference isn't going to
take quite as much space as a definition would, but still the keyword
itself is the smallest part of the reference.  And many keywords will
have several references.

    >> files with 836,439 keywords.

    Selva> Wow ! Didn't know there could be so many keywords, let
    Selva> alone *key*words in this whole world :)

Actually, they're key phrases, and they probably also include
information that allows partial matches or fuzzy matches, and multiple
criteria.  No?

    Selva> Does the time scale exponentially with data volume?

Algorithmically, it shouldn't, with reasonable data structures the
file reading should be a single linear-time pass, and constructing the
index is probably n log n.  But from Tony's travails you should recall
that Namazu very quickly outstripped his memory space; I would bet the
process started doing serious page thrashing at that point.  I rather
doubt that the namazu programmers bothered to review Vol 3 of Knuth
before starting this project.[1]  "Get More RAM!" solves all external
search problems by turning them into internal search.  :-)


Footnotes: 
[1]  For a serious attack on this problem, what you probably want to
do is have a data structure that limits the "top" of the index
structure to about 1/2 of available memory, then pages in "leaf nodes"
from files.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."
--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links