Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Database frontend in Linux



Stephen J. Turnbull wrote:
> Edward Middleton writes:
>  > Josh Glover wrote:
>  > > 2009/5/30 Christian Horn <chorn@example.com>
>  > >> On Sat, May 23, 2009 at 06:05:13PM +0900, Raedwolf Summoner wrote:
>
>  > >>> Pardon my greenhorn status, Christian, but I'm afraid I don't
>  > >>> understand the difference [between a search engine and a
>  > >>> database].
>  > >>>       
>  > >> A database would move or copy the data like soundfiles inside of it,
>  > >> making the data harder to backup etc.
>  > >
>  > > Yes, this is it in a nutshell. Databases, especially relational ones,
>  > > are great for storing data that is related somehow. Search engines are
>  > > better at dealing with data that is not itself related, but with
>  > > related *metadata*.
>  > 
>  > I think structured vs unstructured data is the major difference. 
>
> A search engine is a *type* of database, or perhaps a better way to
> put it, it is a front-end for adding unstructured data to a database.
>   

The difference is that a database deals with structured finite data
(closed world assumption[1]) directly .  A search engine generates
structured data in the form of statistics about the content of documents
and queries the statistical data.  As a result it can only make factual
statements about the statistics (and only statistics for documents it
knows about) it can't make factual claims about the source documents. 
In situations like the web were you have and open world[2] and
unstructured data this is the best you can do, but being able to make
factual claims is obviously more powerful.

>  > Databases are better at finding things like "the title of songs on album
>  > x".  A search engine is better at finding "all things related to x".
>
> I tend to disagree.  A search engine (as currently visible at Google
> and friends) is an automatically updated KWIC database.  They
> generally are pretty good now at suggesting typos, but AFAICT none of
> them track synonyms.  Surely that's the obvious first step for "all
> things related".
>   

x being a hash key like the word "tlug". i.e. any possible usage of that
four letter sting of characters, not the user intended meaning of x.

>  > [snip]
>  > > But there is another problem that is harder to solve, and that is
>  > > relevance. PageRank (Google's algorithm for determining which results
>  > > bubble up to the top for any given search) is all about relevance. [1]
>
> You want semantic web, I guess.  But that can be spammed, too, in fact
> it is likely *easier* to spam that, since evaluations are part of the
> "semantic" part of links.
>   

I would argue that the semantic web makes it harder for spammers because
they are faced with the problem of making their content specific enough
to trick the search engine into thinking they are relevant while
conversely needing to be make more false representations in order to get
access to a sufficient number of users.  Or put another way, they need
to tell more lies.  Obviously the more lies they have to tell the
greater the risk of them contradicting themselves and being caught out.

>  > The problem with page rank is that it doesn't solve the difficult
>  > problem of finding relevance,
>
> It does solve that problem assuming honest linking.  This makes a lot of sense in academic publication (where it's call "citation indexing" rather than "page rank"), because it's expensive to Google bomb (the editors pick the documents containing the "links", so you have to construct a rather interesting document to be able to add your links to the the database).

Well your assumption is that popularity is equivalent to relevance. 
i.e. an interesting[3] document is more relevant and thus conversely
lack of popularity equates to irrelevance, because an unpopular article
will rank lowly in page rank.

Edward

1. http://en.wikipedia.org/wiki/Closed_world_assumption
2. http://en.wikipedia.org/wiki/Open_world_assumption
3. as determined by the number of times it is cited.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links