Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



黒鉄章 writes:

 >     A small precursor to consider is if the filename expansion
 >     (i.e. from *.interim to all the separate files) will exceed the
 >     size of ARG_MAX. On my system you'd be OK (it's ~2M, i.e. more
 >     than ~10k * 20 chars = ~200k)

There are shell limits as well, even if ARG_MAX is huge Jim probably
wants to use xargs.

 > > I'm gearing up for a merging of a very large number of
 > > sorted text files(*). Does anyone know if there is an upper
 > > limit on how many sorted files can be merged using something
 > > like: "sort -m *.interim > final".

I don't know about upper limits, but you might consider whether you
wouldn't get much better performance from a multipass approach.

 > > Also, is it worth fiddling with the "--batch-size=NMERGE" option?

Pretty much what I had in mind.  Specifically, assuming 100-byte
lines, merging 10 files at a time means 4GB in the first pass,
comfortably fitting in your memory and allowing very efficient I/O.
I'll bet that this is a big win (on the first pass only).  On later
passes, the performance analysis is non-trivial, but the I/O
efficiency of having a big buffer for each file in the batch may
outweigh the additional passes.

Do you expect the output file to be ~= 40x10^9 lines!?  Or is some
uniquification going to be applied?  If so, I suspect that
interleaving merge and uniquification passes will be a lot faster.

For quad core, see the --parallel option.  This is better documented
in the Info manual for coreutils than in the man page.

Steve


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links