Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



On 28 May 2014 13:55, Stephen J. Turnbull <stephen@example.com> wrote:
> �\��章 writes:
>
>  >     A small precursor to consider is if the filename expansion
>  >     (i.e. from *.interim to all the separate files) will exceed the
>  >     size of ARG_MAX. On my system you'd be OK (it's ~2M, i.e. more
>  >     than ~10k * 20 chars = ~200k)
>
> There are shell limits as well, even if ARG_MAX is huge Jim probably
> wants to use xargs.

I can't see where xargs would assist with "sort -m ....", as by definition
it wants all the files at once.

>  > > I'm gearing up for a merging of a very large number of
>  > > sorted text files(*). Does anyone know if there is an upper
>  > > limit on how many sorted files can be merged using something
>  > > like: "sort -m *.interim > final".
>
> I don't know about upper limits, but you might consider whether you
> wouldn't get much better performance from a multipass approach.
>
>  > > Also, is it worth fiddling with the "--batch-size=NMERGE" option?
>
> Pretty much what I had in mind.  Specifically, assuming 100-byte
> lines, merging 10 files at a time means 4GB in the first pass,
> comfortably fitting in your memory and allowing very efficient I/O.
> I'll bet that this is a big win (on the first pass only).  On later
> passes, the performance analysis is non-trivial, but the I/O
> efficiency of having a big buffer for each file in the batch may
> outweigh the additional passes.

But does "sort -m ..." pull everything into RAM? If I were
implementing it I'd have a heap of open input files and pop
the individual files as needed. Last night I did a test with
~150 files. I don't know how it went about it, but it only used
a moderate amount of RAM, so I expect it's doing a classical
file merge.

> Do you expect the output file to be ~= 40x10^9 lines!?  Or is some
> uniquification going to be applied?  If so, I suspect that
> interleaving merge and uniquification passes will be a lot faster.

Yes, I'll be uniquificationするing, in which identical lines are counted
and tagged with their frequency.

this
this

will become

this\t2

I can't get sort to do that, and rather than worry about adding a
multi-file merge to my uniquification utility, I'll do it in a separate
pass.

> For quad core, see the --parallel option.  This is better documented
> in the Info manual for coreutils than in the man page.

I can't see that option in either the man page or the info/coreutils
manual. I see when I run sort (but not sort -m) that it goes
parallel by default. "top" shows the processor load going to 300+%.

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links