Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] how to tune reiser4 for millions of files?



On 2010-01-28 14:27 +0100 (Thu), Michal Hajek wrote:

> the analysis itself is not a problem at the moment. I believe that
> by rewriting the program one can compute the whole thing in an hour or
> so.  Especially I believe one could employ cuda and nvidia card to get
> even better result, since the thing is easily paralelizable (? not sure
> about the correct English word). 

You're correct; the word is "parallelizable." That said, if it takes an
hour to run, unless you're running it many times a day it would probably
be a complete waste of time to do the extra work to use your NVidea card
for it. But that's another topic.

> My attention is more on the hw or system side of the problem. That is, 
> can I do something with the system (OS, hw..etc.) to speed things up?

Yes. First step: forget about what filesystems you're using: you're
attacking the difficult rather than the easy side of the problem.

Your best option is to fix whatever's writing the data to use a single
file, or a small number of files. 

By the way, you so far neglected to give us one of the most critical
pieces of information here, which is the size of your data set, but
knowing that it's 7-million-odd "small text files," I'll guess that
they're say they're 1 KB each and you've got 7 GB of data. Things don't
change that much if they're 10 KB each and you have 70 GB, and I'm
guessing if you can process the whole data set "in an hour," it's not
700 GB, which would take more than twice that just to read from a disk
in a straight serial read.

(That 7 GB size, by the way, is what we classify in the database world
as either "small" or "trivial,"; it's fits into well under half of the
main memory in a modern $3000 low-end server.)

Actually, I lied, that's not the most critical: it's really your access
patterns (how you write and read the data that's the issue). For the
smaller size (7GB) it's probably about how fast you can load it into
main memory, and for the larger size (70GB) you'll be getting into disk
access speed.

The single best thing you can do is change the program generating these
data to write everything to a single file, or a relatively small number
of files.

The second best thing is to change either your analyzer to read the
files in directory order (as I said before, readdir()) if it reads the
files only once and the you newfs that filesystem after, or to write
an intermediate program that reads the files in directory order and
rewrites them (to a separate drive) as one or a few large files in a
more optimized format.

If you're going to continue to play around with having lots of small
files, and you're in the 70 GB range rather than the 700 GB range, don't
bother mucking about with filesystems until you've put the whole thing
on an SSD.

Fourth, if you're going to persist in playing with filesystems here,
keep in mind it no longer has anything to do with the performance of
your application and you're just doing it for personal pleasure. You're
in the position of the guy who wrote that wonderful 672-byte chess
program that would run on a 1K Timex Sinclair[1]. People trying to
improve that these days are not doing so because they just want a better
chess program.

[1]: http://users.ox.ac.uk/~uzdm0006/scans/1kchess/

cjs
-- 
Curt Sampson         <cjs@example.com>         +81 90 7737 2974
             http://www.starling-software.com
The power of accurate observation is commonly called cynicism
by those who have not got it.    --George Bernard Shaw


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links