Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



On 28 May 2014 15:59, 黒鉄章 <akira.kurogane@example.com> wrote:
> For each input file the sort process has open the OS will buffer a memory
> page or two for each one (at least). 4k is usual mem page size I believe.
>
> 4k * 10k = 40M doesn't sound bad at all. But if read-ahead buffering is
> putting a lot more than a couple of pages per file in memory that will be
> that many times larger. Actually I would expect this to be happening but
> would have faith in the OS to limit itself to avoid using swap.

That's pretty much my understanding of it. It would be ultimate silliness to
have read-only input pages end up replicated in swap.

> Regarding the count of occurrences you could pipe the "sort -m ...." into
> "uniq -c". I've always been annoyed by the format of uniq (a space-padded,
> fixed-width count as the first column) but if you can live with that you'll
> be getting to what you want quicker. The pipe to uniq will consume it's
> input buffer very quickly so it's not going to be the case that all of the
> output of sort must stay in memory as long as the process is running. Also
> if duplicates are common, your final output file saved to disk will be
> usefully smaller.

In any case the output from "uniq -c" is not what I want, so since I'd need to
reformat it it's easier to use my own utility. It also give me the
option of turning

this  3
this  4

into

this  7

which I can't do with "uniq -c".

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links