Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Database frontend in Linux



Edward Middleton writes:
 > Josh Glover wrote:
 > > 2009/5/30 Christian Horn <chorn@example.com>
 > >> On Sat, May 23, 2009 at 06:05:13PM +0900, Raedwolf Summoner wrote:

 > >>> Pardon my greenhorn status, Christian, but I'm afraid I don't
 > >>> understand the difference [between a search engine and a
 > >>> database].
 > >>>       
 > >> A database would move or copy the data like soundfiles inside of it,
 > >> making the data harder to backup etc.
 > >
 > > Yes, this is it in a nutshell. Databases, especially relational ones,
 > > are great for storing data that is related somehow. Search engines are
 > > better at dealing with data that is not itself related, but with
 > > related *metadata*.
 > 
 > I think structured vs unstructured data is the major difference. 

A search engine is a *type* of database, or perhaps a better way to
put it, it is a front-end for adding unstructured data to a database.

The big difference between Ask Sam and a web search engine is that the
actual documents are stored internally by Ask Sam, whereas a
conventional search engine just stores URLs.

Of course the commercial search engines like Google and Amazon long
ago started caching the result documents, although more recently both
have dropped all pretense of being interested in caching (which is
probably fair use under U.S. copyright law).  Instead they are
creating KWIC-indexed[1] document repositories (which is a pretty good
name for Ask Sam, as I understand Ask Sam).

 > Databases are better at finding things like "the title of songs on album
 > x".  A search engine is better at finding "all things related to x".

I tend to disagree.  A search engine (as currently visible at Google
and friends) is an automatically updated KWIC database.  They
generally are pretty good now at suggesting typos, but AFAICT none of
them track synonyms.  Surely that's the obvious first step for "all
things related".

 > [snip]
 > > But there is another problem that is harder to solve, and that is
 > > relevance. PageRank (Google's algorithm for determining which results
 > > bubble up to the top for any given search) is all about relevance. [1]

You want semantic web, I guess.  But that can be spammed, too, in fact
it is likely *easier* to spam that, since evaluations are part of the
"semantic" part of links.

 > The problem with page rank is that it doesn't solve the difficult
 > problem of finding relevance ,

It does solve that problem assuming honest linking.  This makes a lot
of sense in academic publication (where it's call "citation indexing"
rather than "page rank"), because it's expensive to Google bomb (the
editors pick the documents containing the "links", so you have to
construct a rather interesting document to be able to add your links
to the the database).  The problem you're referring to is that
"fuzaketeru" documents get indexed, too, so link data can be spoofed.
But that's not the fault of the page rank algorithm, per se, that's a
problem for input filtering.



Footnotes: 
[1]  KWIC = key word in context (like WAIS).  Here I'm using it in a
more general sense to include fuzzy algorithms such as those used by
Xapian, as well.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links