Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



On 30 May 2014 10:43, Travis Cardwell <travis.cardwell@example.com> wrote:

> The `sort -m` command does not sum counts, which is why Jim said that he
> will need to use external software to do so.

Exactly, and since I'm aggregating counts, I can't use "uniq -c".

I'm building an n-gram corpus from a large text corpus. So as I
work through the text, I'm collecting things like the 4-gram:

これ は 何 です

As I'm merging and counting I'll have interim files such as;

file-n: これ は 何 です 19

file-m: これ は 何 です 27

leading ultimately to:

file-x: これ は 何 です 46

"sort-m" is the thing to use for merging the presorted
initial and intermediate files, but I still need my own utility
to aggregate them because it can handle the interim counts.

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links