TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)

Date: Fri, 18 Aug 2006 10:00:58 +0900

From: <stephen@example.com>

Subject: [tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)

References: <44E44C72.4050509@example.com> <20060817135036.716b39e1.jep200404@example.com>
Jim writes:

 > > I know that there's the command line "diff" command. But my 
 > > understanding of that is that it compares two individual files and you 
 > > can learn how different they are. 

 > Yup. You don't care what the differences are, 
 > only that the files are exactly the same or not. 
 > 
 >    cmp is the command for that. 

diff -q --text is hardly any less efficient than cmp, and diff has an
-r switch (but that probably doesn't help Dave terribly much, since
there is still going to be a lot of pairs of directories).

 > Timestamps on files might be misleading. 
 > People often unnecessarily 'touch' files. 
 > Which is the latest version of some library file? 

This is why I recommend git; that way he can save the versions,
automatically "link" dupes together, and only have the heuristically
latest visible.  If for some reason the heuristically latest seems
unsatisfactory, he can trivially recover "older" versions of the file
with `git checkout'.

 > BTW, to compare between _directories_, you might have to do something like: 
 > 
 >    find /mnt/cd1 -type f -exec md5sum {} \; | sort >cd1.md5sum.sort
 >    find /mnt/cd2 -type f -exec md5sum {} \; | sort >cd2.md5sum.sort
 > 
 > Then I hack a script to compare cd1.md5sum.sort and cd2.md5sum.sort, 
 > and take whatever action is appropriate. As a hedge, I've usually 
 > decided to rename the duplicates as originalname.duplicate, so as 
 > not to prematurely burn bridges. After I'm satisfied that the 
 > right duplicates have been identified, then I

Why do all this hacking on a per-case basis when most of it has been
done by linus and junkio?  With a much stronger hash function?

 > Instead of rm'ing duplicates, you could hard or soft link them 
 > (with ln), so that the extra space is freed. Of course, the 
 > different information _about_ the file (such as date or permissions) 
 > is lost. 

git has no such problem.

 > > Can anyone recommend something suitable?
 > 
 > Just keep the CDs in a shoe box until the contents are irrelevant, 
 > then dispose of them. 

Mostly they already are; Dave doesn't know which, and that's what he
wants to find out.

N.B. A shoe box's worth of space in Japan is enough to be worth
budgeting.  Ask Josh, who's in pain because he needs to buy CDs to
rip, and has no place to put them used media.
References:

[tlug] Seeking recommendations for file consolidation
From: Dave M G

[tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)
From: Jim

Prev by Date: Re: [tlug] New Custom Server for Linux - Advice Needed

Next by Date: Re: [tlug] New Custom Server for Linux - Advice Needed

Previous by thread: Re: [tlug] How to figure out which files are duplicates

Next by thread: Re: [tlug] Seeking recommendations for file consolidation

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links