Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)
- Date: Fri, 18 Aug 2006 10:00:58 +0900
- From: <stephen@example.com>
- Subject: [tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)
- References: <44E44C72.4050509@example.com> <20060817135036.716b39e1.jep200404@example.com>
Jim writes: > > I know that there's the command line "diff" command. But my > > understanding of that is that it compares two individual files and you > > can learn how different they are. > Yup. You don't care what the differences are, > only that the files are exactly the same or not. > > cmp is the command for that. diff -q --text is hardly any less efficient than cmp, and diff has an -r switch (but that probably doesn't help Dave terribly much, since there is still going to be a lot of pairs of directories). > Timestamps on files might be misleading. > People often unnecessarily 'touch' files. > Which is the latest version of some library file? This is why I recommend git; that way he can save the versions, automatically "link" dupes together, and only have the heuristically latest visible. If for some reason the heuristically latest seems unsatisfactory, he can trivially recover "older" versions of the file with `git checkout'. > BTW, to compare between _directories_, you might have to do something like: > > find /mnt/cd1 -type f -exec md5sum {} \; | sort >cd1.md5sum.sort > find /mnt/cd2 -type f -exec md5sum {} \; | sort >cd2.md5sum.sort > > Then I hack a script to compare cd1.md5sum.sort and cd2.md5sum.sort, > and take whatever action is appropriate. As a hedge, I've usually > decided to rename the duplicates as originalname.duplicate, so as > not to prematurely burn bridges. After I'm satisfied that the > right duplicates have been identified, then I Why do all this hacking on a per-case basis when most of it has been done by linus and junkio? With a much stronger hash function? > Instead of rm'ing duplicates, you could hard or soft link them > (with ln), so that the extra space is freed. Of course, the > different information _about_ the file (such as date or permissions) > is lost. git has no such problem. > > Can anyone recommend something suitable? > > Just keep the CDs in a shoe box until the contents are irrelevant, > then dispose of them. Mostly they already are; Dave doesn't know which, and that's what he wants to find out. N.B. A shoe box's worth of space in Japan is enough to be worth budgeting. Ask Josh, who's in pain because he needs to buy CDs to rip, and has no place to put them used media.
- References:
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] New Custom Server for Linux - Advice Needed
- Next by Date: Re: [tlug] New Custom Server for Linux - Advice Needed
- Previous by thread: Re: [tlug] How to figure out which files are duplicates
- Next by thread: Re: [tlug] Seeking recommendations for file consolidation
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links