Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Limits on file numbers in sort -m



On 2014年05月30日 05:21, Bruno Raoult wrote:
> So "uniq *" was able to read files, but "sort -m *" was not, right?
> And a "uniq | sort | uniq" is not possible???
> 
> I am stupid, I dont understand the issue at all :-(, and I would like
> to understand clearly, with output of commands if possible...

I can be very specific using types...  A strongly-typed sort command would
take a list of orderable elements and return a list of the same (but in
sorted order):

sort :: Ord a => [a] -> [a]

A strongly-typed uniq command (as used) would take a (sorted) list of
elements which can be compared for equality and return a list of elements
with associated counts:

uniq :: Eq a => [a] -> [(a, Int)]

In a strongly-typed shell, `uniq | sort` (`sort . uniq` in function
composition syntax) would have type:

(sort . unq) :: (Eq a, Ord a) => [a] -> [(a, Int)]

`uniq | sort | uniq` would therefore have type:

(uniq . sort . uniq) :: (Eq a, Ord a) => [a] -> [((a, Int), Int)]

As you can see from the return value ([((a, Int), Int)]), the result is a
list of element+count pairs (from the first uniq) with associated counts
(from the second uniq).  Our shell is not strongly-typed, but the result
is essentially the same when passing around strings.  It does not meet the
requirements. [1]

What is needed is a command that sums the counts of equal elements when
merging.  In the style of a merge sort:

merge :: Eq a => [(a, Int)] -> [(a, Int)] -> [(a, Int)]

The `sort -m` command does not sum counts, which is why Jim said that he
will need to use external software to do so.

Cheers,

Travis

[1] Check the output of the following commands:

$ sort -R /usr/share/dict/words | head -n 30000 | sort > words.1
$ sort -R /usr/share/dict/words | head -n 30000 | sort > words.2
$ sort -R /usr/share/dict/words | head -n 30000 | sort > words.3
$ sort -R /usr/share/dict/words | head -n 30000 | sort > words.4

$ sort -m words.1 words.2 | uniq -c > words.12
$ sort -m words.3 words.4 | uniq -c > words.34

$ sort -m words.12 words.34 | uniq -c > words.1234


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links