Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Limits on file numbers in sort -m
- Date: Wed, 28 May 2014 14:59:21 +0900
- From: 黒鉄章 <akira.kurogane@example.com>
- Subject: Re: [tlug] Limits on file numbers in sort -m
- References: <CABHGxq7jYkDDLkF8uzzNK8WeU+37t1wgpVhk6VD2HQKyEi7wBw@mail.gmail.com> <CAJMSLH618MfmhL9ufAOfLXxw52i4STpF8dsc_+xe-2GRB3JM8g@mail.gmail.com> <87bnui8sky.fsf@uwakimon.sk.tsukuba.ac.jp> <CABHGxq4NEBMVR8jndiEvcgsGkc_B0f-qcrs2sFjqaAdWH3n9sw@mail.gmail.com>
For each input file the sort process has open the OS will buffer a memory page or two for each one (at least). 4k is usual mem page size I believe.4k * 10k = 40M doesn't sound bad at all. But if read-ahead buffering is putting a lot more than a couple of pages per file in memory that will be that many times larger. Actually I would expect this to be happening but would have faith in the OS to limit itself to avoid using swap.Regarding the count of occurrences you could pipe the "sort -m ...." into "uniq -c". I've always been annoyed by the format of uniq (a space-padded, fixed-width count as the first column) but if you can live with that you'll be getting to what you want quicker. The pipe to uniq will consume it's input buffer very quickly so it's not going to be the case that all of the output of sort must stay in memory as long as the process is running. Also if duplicates are common, your final output file saved to disk will be usefully smaller.Cheers,AkiraOn Wed, May 28, 2014 at 2:21 PM, Jim Breen <jimbreen@example.com> wrote:
On 28 May 2014 13:55, Stephen J. Turnbull <stephen@example.com> wrote:I can't see where xargs would assist with "sort -m ....", as by definition
> 黒鉄章 writes:
>
> > A small precursor to consider is if the filename expansion
> > (i.e. from *.interim to all the separate files) will exceed the
> > size of ARG_MAX. On my system you'd be OK (it's ~2M, i.e. more
> > than ~10k * 20 chars = ~200k)
>
> There are shell limits as well, even if ARG_MAX is huge Jim probably
> wants to use xargs.
it wants all the files at once.
But does "sort -m ..." pull everything into RAM? If I were
> > > I'm gearing up for a merging of a very large number of
> > > sorted text files(*). Does anyone know if there is an upper
> > > limit on how many sorted files can be merged using something
> > > like: "sort -m *.interim > final".
>
> I don't know about upper limits, but you might consider whether you
> wouldn't get much better performance from a multipass approach.
>
> > > Also, is it worth fiddling with the "--batch-size=NMERGE" option?
>
> Pretty much what I had in mind. Specifically, assuming 100-byte
> lines, merging 10 files at a time means 4GB in the first pass,
> comfortably fitting in your memory and allowing very efficient I/O.
> I'll bet that this is a big win (on the first pass only). On later
> passes, the performance analysis is non-trivial, but the I/O
> efficiency of having a big buffer for each file in the batch may
> outweigh the additional passes.
implementing it I'd have a heap of open input files and pop
the individual files as needed. Last night I did a test with
~150 files. I don't know how it went about it, but it only used
a moderate amount of RAM, so I expect it's doing a classical
file merge.
Yes, I'll be uniquificationするing, in which identical lines are counted
> Do you expect the output file to be ~= 40x10^9 lines!? Or is some
> uniquification going to be applied? If so, I suspect that
> interleaving merge and uniquification passes will be a lot faster.
and tagged with their frequency.
this
this
will become
this\t2
I can't get sort to do that, and rather than worry about adding a
multi-file merge to my uniquification utility, I'll do it in a separate
pass.
I can't see that option in either the man page or the info/coreutils
> For quad core, see the --parallel option. This is better documented
> in the Info manual for coreutils than in the man page.
manual. I see when I run sort (but not sort -m) that it goes
parallel by default. "top" shows the processor load going to 300+%.
Cheers
Jim
--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
--
To unsubscribe from this mailing list,
please see the instructions at http://lists.tlug.jp/list.html
The TLUG mailing list is hosted by ASAHI Net, provider of mobile and
fixed broadband Internet services to individuals and corporations.
Visit ASAHI Net's English-language Web page: http://asahi-net.jp/en/
- Follow-Ups:
- Re: [tlug] Limits on file numbers in sort -m
- From: Kalin KOZHUHAROV
- Re: [tlug] Limits on file numbers in sort -m
- From: Jim Breen
- References:
- [tlug] Limits on file numbers in sort -m
- From: Jim Breen
- Re: [tlug] Limits on file numbers in sort -m
- From: 黒鉄章
- Re: [tlug] Limits on file numbers in sort -m
- From: Stephen J. Turnbull
- Re: [tlug] Limits on file numbers in sort -m
- From: Jim Breen
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Limits on file numbers in sort -m
- Next by Date: Re: [tlug] Limits on file numbers in sort -m
- Previous by thread: Re: [tlug] Limits on file numbers in sort -m
- Next by thread: Re: [tlug] Limits on file numbers in sort -m
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links