Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



On Sun, Jun 1, 2014 at 8:14 AM, Travis Cardwell <travis.cardwell@example.com> wrote:
I am not convinced that this is an issue, as --batch-size can be used to
specify how many files are opened at once.

I am not sure to understand the algo in this case (merge). Let say max input is 3, and one of them is closed (a new file opened). No way to compare the lines to old ones, without starting a new search (beside keeping a lot in memory - where we can find limits - which is surely not what people want for a simple merge). Or maybe sort just opens/closes files all the time.
 
> - if you don't care CPU processing limit, a small script and a small DB can
> do everything (sqlite, etc...). It could be
> expensive, but not so much, if the insert script makes a "+1" to a given
> key on insert, and real-time update disabled.
> You will not even need to keep the original files sorted.

The time requirements for this is O(N log N).  You no longer need to keep
input files sorted, but you gain nothing if they already are sorted.

You assume a DB algorithm, that you should not, "a priori" (a tree is not a hash, etc...).
 

I usually use databases for tasks such as this one, btw. :)

Well. We are 2 at least.
 
> You will get your
> output (key/number) immediately.

A select (on an indexed column) is O(log N), not immediate (O(1)).

A perfect hash would give an O(1). With a finite number of keys, and an infinite DB size. Again, your O(log N) is only valid for a specific DB.
 

In general, however, I find that shell scripts require more maintenance
than scripts/programs that are written in a more capable language.

I dont know: If a shell can do a task, I don't see where Java or C or C++ could be easier to maintain...

br.

--
2 + 2 = 5, for very large values of 2.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links