Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Are ordered hashes useful?



Josh Glover writes:
 > On 15 February 2011 17:47, Marty Pauley <marty.pauley@example.com> wrote:
 > 
 > > Hashes don't cope well with a large amount of data (unless it's known
 > > in advance).

That will come as news to Mr. Torvalds (vice git author).  In part it
depends on whether the hash value of the key (ie, its bucket) needs to
be invariant.  In git it does; but git also has an awful lot of
buckets, and has no trouble that I've heard of in dealing with large
amounts of data. :-)

However, there is also the strategy of rehashing with a larger number
of buckets, and using a hash function which is parametrized by the
number of buckets.  XEmacs uses this strategy, and I've never heard of
an application that benefits measurably from using anything but the
default initial size for time performance reasons.  There are a few
applications which for some reason use lots of small hashtables, and
those benefit space-wise from specifying an appropriate initial size
smaller than the default (which is actualy pretty small, 29 IIRC).

I forget the details, so I may be misremembering, but I believe that
under practical conditions this gives constant amortized time for all
operations (ie, it's bounded above including the time spent on
rehashing when the hash table is reallocated with a larger size).  I
know it's got better asymptotic behavior than a binary search tree,
but I don't know if it is that much better in practice.

 > Just for the benefit of the list:
 > 
 > A "hash" (associative array / dictionary) is usually implemented as an
 > array of b number of buckets, with each bucket corresponding to one
 > output of the hashing function.
[snip]
 > There are more sophisticated implementations I'm sure, but that is how
 > it works in theory.

Yes, there are many alternative implementations.  In particular,
somebody already mentioned using a tree instead of a list to store
multiple objects in a bucket, although I don't know of applications
where that is actually useful.  There is also double hashing.





Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links