Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Clustering algorithms



Hello Dave,

I am not an expert, still here are some ideas. Take them for what they
are worth.

It all starts with the data you collect. Let's say that for each
session you remember the products a user has seen and the products
they bought. We will define a session as "repeated visits from the
same IP terminated by a 30 minutes pause".

Suppose you have a table with columns TIMESTAMP, SESSION_ID, ITEM and
ACTION. Where session ID is unique identifier for the session (say a
counter)  and ACTION would be an enumerated type having values like
'view', 'add', 'remove', 'buy', etc. Note that I don't suggest that
you use the user's IP or account details. If you keep such
information, please, please make sure that you put special access
restrictions to these tables, use hashing where appropriate and purge
as much as possible to an offline backup storage.

One algorithm utilizing this model would be to create a graph with
products as nodes and weighted edges connecting nodes in the same
session. Reading from the database, you can increment the weight when
the particular edge is present in the session. Finally, you go through
all the nodes and leave the top 10 edges and here is your list of
recommended products. This graph can be precalculated (say once a
week) and kept in a database for querying.

You can tweak the algorithm by changing the weighting:

1. Depending on the distance - if product views are closer on the
timeline in the session use heavier weight or do not add edges if the
product views are more than 3 pages apart.

2. Depending on the action - view adds weight x1, add - x3  and purchase x10

3. You can build separate graph for each operation and have more
complex queries

Given that you don't have any objective criteria for correctness, it
shouldn't be too difficult to get something that would be good enough.

Another possible algorithm could be using Markov chains to put an
emphasis on the sequence of visiting. You can ialso nvestigate using
RDF + SPARQL for storing the data directly in graph form and real-time
querying or a Prolog-type language, where you can model the graph in
the language syntax itself and it's easier to implement certain types
of queries.

Cheers and good luck,
Dimitar


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links