Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Database frontend in Linux

2009/5/30 Christian Horn <>:

> On Sat, May 23, 2009 at 06:05:13PM +0900, Raedwolf Summoner wrote:
>> Pardon my greenhorn status, Christian, but I'm afraid I don't
>> understand the difference [between a search engine and a
>> database].
> A database would move or copy the data like soundfiles inside of it,
> making the data harder to backup etc.

Yes, this is it in a nutshell. Databases, especially relational ones,
are great for storing data that is related somehow. Search engines are
better at dealing with data that is not itself related, but with
related *metadata*.

To give you a simple example, think of doing a Google search on the
phrase "tlug". I know of at least two Linux Users Groups that are
called TLUG (Tokyo and Townsend, AU), and I assume that those four
letters will match countless other documents as well. If these
documents are stored in a relational database, you could have them
related by name, and that would make sense, in that limited dimension.
But what if you add the word "linux" to your phrase? Now only LUGs
will match (in a perfect world). So in your database, you now need to
add some way to relate documents by interest.

Let's actually perform the search on Google, and see how many
documents are returned:

Around 65,700 documents are returned, which is quite manageable for a
good database. But let's start adding keywords like crazy to our
search phrase:!turnbull

Seven documents! Sweet! But now we have around five discrete pieces of
metadata about documents in our query: name, operating system, city,
software, Linux distribution, and author.

Imagine that you have millions of documents in your database. Each
piece of metadata is likely to be in a table which is indexed by value
and has a key back to the unique document ID. But just returning
results for queries across five pieces of metadata will require a
six-way join. As you expand the available metadata, not only do you
need to (probably) create a new table for it, your join gets bigger
and bigger. Database engines are simply not optimised for this kind of
query. They *can* be optimised for this; think data warehouses, which
are basically a sort of search engine for business intelligence.

But there is another problem that is harder to solve, and that is
relevance. PageRank (Google's algorithm for determining which results
bubble up to the top for any given search) is all about relevance. [1]
It cares a lot about how popular a document is, which is determined by
static analysis such as building massive graphs that show how well
linked-to a document is, and feedback loops that ensure that documents
that are clicked on a lot for a given search term move up the result
list. This is why I don't *have* to do anything more than the
following search to get stuff about the Tokyo Linux Users Group:

Can a relational database do that? Not really, as relational databases
are not designed to implement feedback loops, and the query analyser
is unlikely to build a plan flexible enough to handle, say five
discrete query patterns.

A search engine, on the other hand, is designed to build an index of
metadata, which is typically kept quite separate from the data it
describes. This has the advantage of affording more query flexibility
at a lower cost, as well as making it fairly easy to build feedback
mechanisms into the indexing.

Anyway, this is probably more detail than most people care about, so
I'll stop here unless prompted to continue. ;)


[1] I don't work for Google, so all of this is speculation based on
the information they've publicaly released about PageRank. And don't
read too much into this about how Amazon's search engine works;
product search and web search are two different beasties (I'd like to
believe that web search is easier, of course). ;)

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links