Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Database frontend in Linux
- Date: Sun, 31 May 2009 13:03:32 +0100
- From: Josh Glover <jmglov@example.com>
- Subject: Re: [tlug] Database frontend in Linux
- References: <mailman.1.1243047601.6031.tlug@example.com> <BAY108-W32CB26AA19B9A7AE47072DA2570@example.com> <20090530102723.GA7204@example.com>
2009/5/30 Christian Horn <chorn@example.com>: > On Sat, May 23, 2009 at 06:05:13PM +0900, Raedwolf Summoner wrote: > >> Pardon my greenhorn status, Christian, but I'm afraid I don't >> understand the difference [between a search engine and a >> database]. > > A database would move or copy the data like soundfiles inside of it, > making the data harder to backup etc. Yes, this is it in a nutshell. Databases, especially relational ones, are great for storing data that is related somehow. Search engines are better at dealing with data that is not itself related, but with related *metadata*. To give you a simple example, think of doing a Google search on the phrase "tlug". I know of at least two Linux Users Groups that are called TLUG (Tokyo and Townsend, AU), and I assume that those four letters will match countless other documents as well. If these documents are stored in a relational database, you could have them related by name, and that would make sense, in that limited dimension. But what if you add the word "linux" to your phrase? Now only LUGs will match (in a perfect world). So in your database, you now need to add some way to relate documents by interest. Let's actually perform the search on Google, and see how many documents are returned: http://www.google.com/search?q=tlug+linux Around 65,700 documents are returned, which is quite manageable for a good database. But let's start adding keywords like crazy to our search phrase: http://www.google.com/search?q=tlug+linux+tokyo+xorg+gentoo+glover+!turnbull Seven documents! Sweet! But now we have around five discrete pieces of metadata about documents in our query: name, operating system, city, software, Linux distribution, and author. Imagine that you have millions of documents in your database. Each piece of metadata is likely to be in a table which is indexed by value and has a key back to the unique document ID. But just returning results for queries across five pieces of metadata will require a six-way join. As you expand the available metadata, not only do you need to (probably) create a new table for it, your join gets bigger and bigger. Database engines are simply not optimised for this kind of query. They *can* be optimised for this; think data warehouses, which are basically a sort of search engine for business intelligence. But there is another problem that is harder to solve, and that is relevance. PageRank (Google's algorithm for determining which results bubble up to the top for any given search) is all about relevance. [1] It cares a lot about how popular a document is, which is determined by static analysis such as building massive graphs that show how well linked-to a document is, and feedback loops that ensure that documents that are clicked on a lot for a given search term move up the result list. This is why I don't *have* to do anything more than the following search to get stuff about the Tokyo Linux Users Group: http://www.google.com/search?q=tlug Can a relational database do that? Not really, as relational databases are not designed to implement feedback loops, and the query analyser is unlikely to build a plan flexible enough to handle, say five discrete query patterns. A search engine, on the other hand, is designed to build an index of metadata, which is typically kept quite separate from the data it describes. This has the advantage of affording more query flexibility at a lower cost, as well as making it fairly easy to build feedback mechanisms into the indexing. Anyway, this is probably more detail than most people care about, so I'll stop here unless prompted to continue. ;) Cheers, Josh [1] I don't work for Google, so all of this is speculation based on the information they've publicaly released about PageRank. And don't read too much into this about how Amazon's search engine works; product search and web search are two different beasties (I'd like to believe that web search is easier, of course). ;)
- References:
- Re: [tlug] Database frontend in Linux
- From: Raedwolf Summoner
- Re: [tlug] Database frontend in Linux
- From: Christian Horn
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Android in Japan at last?
- Next by Date: Re: [tlug] Amarok 2 sucks
- Previous by thread: Re: [tlug] Database frontend in Linux
- Next by thread: Re: [tlug] Database frontend in Linux
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links