
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Limits on file numbers in sort -m
黒鉄章 writes:
> A small precursor to consider is if the filename expansion
> (i.e. from *.interim to all the separate files) will exceed the
> size of ARG_MAX. On my system you'd be OK (it's ~2M, i.e. more
> than ~10k * 20 chars = ~200k)
There are shell limits as well, even if ARG_MAX is huge Jim probably
wants to use xargs.
> > I'm gearing up for a merging of a very large number of
> > sorted text files(*). Does anyone know if there is an upper
> > limit on how many sorted files can be merged using something
> > like: "sort -m *.interim > final".
I don't know about upper limits, but you might consider whether you
wouldn't get much better performance from a multipass approach.
> > Also, is it worth fiddling with the "--batch-size=NMERGE" option?
Pretty much what I had in mind. Specifically, assuming 100-byte
lines, merging 10 files at a time means 4GB in the first pass,
comfortably fitting in your memory and allowing very efficient I/O.
I'll bet that this is a big win (on the first pass only). On later
passes, the performance analysis is non-trivial, but the I/O
efficiency of having a big buffer for each file in the batch may
outweigh the additional passes.
Do you expect the output file to be ~= 40x10^9 lines!? Or is some
uniquification going to be applied? If so, I suspect that
interleaving merge and uniquification passes will be a lot faster.
For quad core, see the --parallel option. This is better documented
in the Info manual for coreutils than in the man page.
Steve
Home |
Main Index |
Thread Index