
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Limits on file numbers in sort -m
On 28 May 2014 13:55, Stephen J. Turnbull <stephen@example.com> wrote:
> �\��章 writes:
>
> > A small precursor to consider is if the filename expansion
> > (i.e. from *.interim to all the separate files) will exceed the
> > size of ARG_MAX. On my system you'd be OK (it's ~2M, i.e. more
> > than ~10k * 20 chars = ~200k)
>
> There are shell limits as well, even if ARG_MAX is huge Jim probably
> wants to use xargs.
I can't see where xargs would assist with "sort -m ....", as by definition
it wants all the files at once.
> > > I'm gearing up for a merging of a very large number of
> > > sorted text files(*). Does anyone know if there is an upper
> > > limit on how many sorted files can be merged using something
> > > like: "sort -m *.interim > final".
>
> I don't know about upper limits, but you might consider whether you
> wouldn't get much better performance from a multipass approach.
>
> > > Also, is it worth fiddling with the "--batch-size=NMERGE" option?
>
> Pretty much what I had in mind. Specifically, assuming 100-byte
> lines, merging 10 files at a time means 4GB in the first pass,
> comfortably fitting in your memory and allowing very efficient I/O.
> I'll bet that this is a big win (on the first pass only). On later
> passes, the performance analysis is non-trivial, but the I/O
> efficiency of having a big buffer for each file in the batch may
> outweigh the additional passes.
But does "sort -m ..." pull everything into RAM? If I were
implementing it I'd have a heap of open input files and pop
the individual files as needed. Last night I did a test with
~150 files. I don't know how it went about it, but it only used
a moderate amount of RAM, so I expect it's doing a classical
file merge.
> Do you expect the output file to be ~= 40x10^9 lines!? Or is some
> uniquification going to be applied? If so, I suspect that
> interleaving merge and uniquification passes will be a lot faster.
Yes, I'll be uniquificationするing, in which identical lines are counted
and tagged with their frequency.
this
this
will become
this\t2
I can't get sort to do that, and rather than worry about adding a
multi-file merge to my uniquification utility, I'll do it in a separate
pass.
> For quad core, see the --parallel option. This is better documented
> in the Info manual for coreutils than in the man page.
I can't see that option in either the man page or the info/coreutils
manual. I see when I run sort (but not sort -m) that it goes
parallel by default. "top" shows the processor load going to 300+%.
Cheers
Jim
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
Home |
Main Index |
Thread Index