Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)
- Date: Thu, 17 Aug 2006 13:50:36 -0400
- From: Jim <jep200404@example.com>
- Subject: [tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)
- References: <44E44C72.4050509@example.com>
Just keep the CDs in a shoe^H^H^H^Hsandalwood box until the contents are irrelevant, then dispose of them. Dave M G wrote: > I have a whole bunch of back up CD-ROMs from the last few years. Mostly > from my days as a Windows user. > Now that my current computer has many gigabytes of free space, Ahh, nature abhores a vacuum. The junk expands to fill the available space. > I'm copying the contents of all the CD-ROMS to a directory on my hard drive. If you created the CD-ROMs (i.e., CD-Rs) with just your own data files, then copying the _contents_ should not be any big problem. Even so, there are little differences between Windows and Linux CDs. Check out the comments about Joliet in man mkisofs and google. It's safer to copy the whole disk as an ISO image and then mount that image. If you are copying the contents of CD-ROMs, that you did not make, particularly if they are bootable, it is safer to copy the CD-ROMs as intact whole _images_, instead of copying the contents. > ... I hoped to find a way I can weed out duplicates and be > left with one set of just the most recent versions of > unique files. > I know that there's the command line "diff" command. But my > understanding of that is that it compares two individual files and you > can learn how different they are. Yup. You don't care what the differences are, only that the files are exactly the same or not. cmp is the command for that. > That seems a little different than the > more global comparison between multiple directories that I'm looking for. Yup. Even cmp only compares two files at a time. If the files can have different names, you have to do a huge number of comparisons. For n files, doing O2 comparisons takes too long. find /mnt/cdrom -type f -exec md5sum {} \; | sort will let you know which files are identical, but how will you know which is "newest"? Timestamps on files might be misleading. People often unnecessarily 'touch' files. Which is the latest version of some library file? You'll want to incorporate into sort, your judgement about whichever file is "newest". Maybe you'll have to right your own sort. Maybe you'll have to write some code to generate a monotonic number for the "newness" of a file, and inject that between the md5sum and filename in *.md5sum, then run sort. BTW, to compare between _directories_, you might have to do something like: find /mnt/cd1 -type f -exec md5sum {} \; | sort >cd1.md5sum.sort find /mnt/cd2 -type f -exec md5sum {} \; | sort >cd2.md5sum.sort Then I hack a script to compare cd1.md5sum.sort and cd2.md5sum.sort, and take whatever action is appropriate. As a hedge, I've usually decided to rename the duplicates as originalname.duplicate, so as not to prematurely burn bridges. After I'm satisfied that the right duplicates have been identified, then I find directory -iname '*.duplicate' -exec rm -f {} \; (having made sure that no files originally had *.duplicate filenames, before I started renaming the duplicate files) Instead of rm'ing duplicates, you could hard or soft link them (with ln), so that the extra space is freed. Of course, the different information _about_ the file (such as date or permissions) is lost. > Can anyone recommend something suitable? Just keep the CDs in a shoe box until the contents are irrelevant, then dispose of them.
- Follow-Ups:
- References:
- [tlug] Seeking recommendations for file consolidation
- From: Dave M G
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Multiple players at same instance.
- Next by Date: Re: [tlug] How to figure out which files are duplicates
- Previous by thread: Re: [tlug] Seeking recommendations for file consolidation
- Next by thread: Re: [tlug] How to figure out which files are duplicates
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links