Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



On 2014年05月30日 23:22, Bruno Raoult wrote:
> Also, and if I understood, the only issue is the number of possible opened
> files per process (especially
> per user, and that you are not able to change the system limits for a
> specific user (=this is not your
> machine)), right?

Thank you for pointing that out again, as it reminded me of practical
issues. ;)

In my implementation, each [nmerge] process has a maximum of 6 files open
at any given time, however, so it does not run into such limits.

To avoid any ARG_MAX issues, I just added a -r flag that enables
recursion.  A command such as `nmerge -r -o output.txt input` will recurse
through the input directory, merging all files found.

https://github.com/TravisCardwell/merge/commit/22747404a7b23ad40c267131940b27b9f5e33c32

> Or you have disk space limit, so you can't do simple things, such as using
> double space from original files.

Note that if you want to keep the original files, then ~triple the space
is required.

>  - either on #files opened (sort | uniq will work)

I am not convinced that this is an issue, as --batch-size can be used to
specify how many files are opened at once.

> - either on disk space (no need to use uniq at temp stage, only at last one)

Reducing during the merge is only beneficial if the number of reductions
is significant.  If they are not significant, then the time requirements
of either are O(N log M) where N is the total number of lines and M is the
number of (roughly equal-in-size, pre-sorted) input files.

> - if you don't care CPU processing limit, a small script and a small DB can
> do everything (sqlite, etc...). It could be
> expensive, but not so much, if the insert script makes a "+1" to a given
> key on insert, and real-time update disabled.
> You will not even need to keep the original files sorted.

The time requirements for this is O(N log N).  You no longer need to keep
input files sorted, but you gain nothing if they already are sorted.

I usually use databases for tasks such as this one, btw. :)

> You will get your
> output (key/number) immediately.

A select (on an indexed column) is O(log N), not immediate (O(1)).

> Instead of creating a new program, I would try everything else first, to be
> sure nothing is possible with existing commands...
>
> Just to avoid maintenance.

With tasks such as this one, which can likely be solved using a few
standard Unix utilities, I totally agree.  (My implementation was an
exercise in Go parallelization.)

In general, however, I find that shell scripts require more maintenance
than scripts/programs that are written in a more capable language.

Cheers,

Travis


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links