Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Mail archiving question



Some background, then some questions.....

I foolishly volunteered to help set up a searchable
email archive for the Honyaku mailing list (A few
TLUGers are also on that list.) My current task is to
extract the essential headers (From, Subject, Date, ...)
and the body of the email, convert them to UTF-8 and
store them as one file per email. I am working on a collection
of about 40,000 accumulated emails from the last 18 months.

My first thought was to pipe each email through metamail, as
this would unpack things like Base64 and printed-quotable.
I can usually work out the coding from the MIME headers, so
converting to UTF-8 is not a big problem, Preliminary testing went
very well.

The metamail approach has run into a snag with emails
containg html. It goes and invokes my default browser
(Firefox), which is not much use when I'm batch-processing.
In many cases I can get around this by detecting there is
a second part to the email containing html, and simply throw it
away, however in some cases the html is in a Base64 coded block
so I'm not aware of it.

Another problem is the horrible Microsoft TNEF format, which
ignores email rules, doesn't have MIME information, etc.
Metamail simply throws in the towel on these.

Anyway, to get to my questions:

- has anyone done this sort of thing before and can suggest
perhaps an alternative approach?

- if I am sticking with metamail, is there any easy way
to get it to ignore html rather than hitting htmlview?

Cheers

Jim
-- 
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links