TLUG Mailing List

For each input file the sort process has open the OS will buffer a memory page or two for each one (at least). 4k is usual mem page size I believe.

4k * 10k = 40M doesn't sound bad at all. But if read-ahead buffering is putting a lot more than a couple of pages per file in memory that will be that many times larger. Actually I would expect this to be happening but would have faith in the OS to limit itself to avoid using swap.

Regarding the count of occurrences you could pipe the "sort -m ...." into "uniq -c". I've always been annoyed by the format of uniq (a space-padded, fixed-width count as the first column) but if you can live with that you'll be getting to what you want quicker. The pipe to uniq will consume it's input buffer very quickly so it's not going to be the case that all of the output of sort must stay in memory as long as the process is running. Also if duplicates are common, your final output file saved to disk will be usefully smaller.

Cheers,

Akira

On Wed, May 28, 2014 at 2:21 PM, Jim Breen <jimbreen@example.com> wrote:

On 28 May 2014 13:55, Stephen J. Turnbull <stephen@example.com> wrote:
> 黒鉄章 writes:
>
> > A small precursor to consider is if the filename expansion
> > (i.e. from *.interim to all the separate files) will exceed the
> > size of ARG_MAX. On my system you'd be OK (it's ~2M, i.e. more
> > than ~10k * 20 chars = ~200k)
>
> There are shell limits as well, even if ARG_MAX is huge Jim probably
> wants to use xargs.

I can't see where xargs would assist with "sort -m ....", as by definition
it wants all the files at once.

> > > I'm gearing up for a merging of a very large number of
> > > sorted text files(*). Does anyone know if there is an upper
> > > limit on how many sorted files can be merged using something
> > > like: "sort -m *.interim > final".
>
> I don't know about upper limits, but you might consider whether you
> wouldn't get much better performance from a multipass approach.
>
> > > Also, is it worth fiddling with the "--batch-size=NMERGE" option?
>
> Pretty much what I had in mind. Specifically, assuming 100-byte
> lines, merging 10 files at a time means 4GB in the first pass,
> comfortably fitting in your memory and allowing very efficient I/O.
> I'll bet that this is a big win (on the first pass only). On later
> passes, the performance analysis is non-trivial, but the I/O
> efficiency of having a big buffer for each file in the batch may
> outweigh the additional passes.

But does "sort -m ..." pull everything into RAM? If I were
implementing it I'd have a heap of open input files and pop
the individual files as needed. Last night I did a test with
~150 files. I don't know how it went about it, but it only used
a moderate amount of RAM, so I expect it's doing a classical
file merge.

> Do you expect the output file to be ~= 40x10^9 lines!? Or is some
> uniquification going to be applied? If so, I suspect that
> interleaving merge and uniquification passes will be a lot faster.

Yes, I'll be uniquificationするing, in which identical lines are counted
and tagged with their frequency.

this
this

will become

this\t2

I can't get sort to do that, and rather than worry about adding a
multi-file merge to my uniquification utility, I'll do it in a separate
pass.

> For quad core, see the --parallel option. This is better documented
> in the Info manual for coreutils than in the man page.

I can't see that option in either the man page or the info/coreutils
manual. I see when I run sort (but not sort -m) that it goes
parallel by default. "top" shows the processor load going to 300+%.

Cheers

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University

--
To unsubscribe from this mailing list,
please see the instructions at http://lists.tlug.jp/list.html

The TLUG mailing list is hosted by ASAHI Net, provider of mobile and
fixed broadband Internet services to individuals and corporations.
Visit ASAHI Net's English-language Web page: http://asahi-net.jp/en/

Re: [tlug] Limits on file numbers in sort -m