Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



On Fri, May 30, 2014 at 3:14 AM, Jim Breen <jimbreen@example.com> wrote:
On 30 May 2014 10:43, Travis Cardwell <travis.cardwell@example.com> wrote:

> The `sort -m` command does not sum counts, which is why Jim said that he
> will need to use external software to do so.

Exactly, and since I'm aggregating counts, I can't use "uniq -c".

I'm building an n-gram corpus from a large text corpus. So as I
work through the text, I'm collecting things like the 4-gram:

これ は 何 です

As I'm merging and counting I'll have interim files such as;

file-n: これ は 何 です 19

file-m: これ は 何 です 27

OK. I did not understand you needed to keep the interim counts (maybe for some other usage)...
I thought you needed the final result only (with uniq being run at a final stage). If you don't, I am
not sure why the temp uniq is necessary (given disk space is not an issue).

Also, and if I understood, the only issue is the number of possible opened files per process (especially
per user, and that you are not able to change the system limits for a specific user (=this is not your
machine)), right?
Or you have disk space limit, so you can't do simple things, such as using double space from original files.
Or, you have time processing limit.
If you can break only one of these limits, solutions are possible:
- either on #files opened (sort | uniq will work)
- either on disk space (no need to use uniq at temp stage, only at last one)
- if you don't care CPU processing limit, a small script and a small DB can do everything (sqlite, etc...). It could be
expensive, but not so much, if the insert script makes a "+1" to a given key on insert, and real-time update disabled.
You will not even need to keep the original files sorted. You will get your output (key/number) immediately.

Instead of creating a new program, I would try everything else first, to be sure nothing is possible with existing commands...

Just to avoid maintenance.

br.

--
2 + 2 = 5, for very large values of 2.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links