Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] How to figure out which files are duplicates . . . . . . . . . (was Re: Seeking recommendations for file consolidation)



Just keep the CDs in a shoe^H^H^H^Hsandalwood box until the 
contents are irrelevant, then dispose of them. 

Dave M G wrote:

> I have a whole bunch of back up CD-ROMs from the last few years. Mostly 
> from my days as a Windows user.

> Now that my current computer has many gigabytes of free space, 

Ahh, nature abhores a vacuum. The junk expands to fill the available space. 

> I'm copying the contents of all the CD-ROMS to a directory on my hard drive. 

If you created the CD-ROMs (i.e., CD-Rs) with just your own data files, 
then copying the _contents_ should not be any big problem. 
Even so, there are little differences between Windows and Linux CDs. 
Check out the comments about Joliet in man mkisofs and google. 
It's safer to copy the whole disk as an ISO image and then 
mount that image. 

If you are copying the contents of CD-ROMs, that you did not make, 
particularly if they are bootable, it is safer to copy the CD-ROMs 
as intact whole _images_, instead of copying the contents. 

> ... I hoped to find a way I can weed out duplicates and be 
> left with one set of just the most recent versions of 
> unique files.

> I know that there's the command line "diff" command. But my 
> understanding of that is that it compares two individual files and you 
> can learn how different they are. 

Yup. You don't care what the differences are, 
only that the files are exactly the same or not. 

   cmp is the command for that. 

> That seems a little different than the 
> more global comparison between multiple directories that I'm looking for.

Yup. Even cmp only compares two files at a time. 
If the files can have different names, 
you have to do a huge number of comparisons. 
For n files, doing O2 comparisons takes too long. 

   find /mnt/cdrom -type f -exec md5sum {} \; | sort

will let you know which files are identical, 
but how will you know which is "newest"? 

Timestamps on files might be misleading. 
People often unnecessarily 'touch' files. 
Which is the latest version of some library file? 

You'll want to incorporate into sort, your judgement about whichever 
file is "newest". Maybe you'll have to right your own sort. 
Maybe you'll have to write some code to generate a monotonic number 
for the "newness" of a file, and inject that between the md5sum 
and filename in *.md5sum, then run sort. 

BTW, to compare between _directories_, you might have to do something like: 

   find /mnt/cd1 -type f -exec md5sum {} \; | sort >cd1.md5sum.sort
   find /mnt/cd2 -type f -exec md5sum {} \; | sort >cd2.md5sum.sort

Then I hack a script to compare cd1.md5sum.sort and cd2.md5sum.sort, 
and take whatever action is appropriate. As a hedge, I've usually 
decided to rename the duplicates as originalname.duplicate, so as 
not to prematurely burn bridges. After I'm satisfied that the 
right duplicates have been identified, then I

   find directory -iname '*.duplicate' -exec rm -f {} \;

(having made sure that no files originally had *.duplicate filenames, 
before I started renaming the duplicate files)

Instead of rm'ing duplicates, you could hard or soft link them 
(with ln), so that the extra space is freed. Of course, the 
different information _about_ the file (such as date or permissions) 
is lost. 

> Can anyone recommend something suitable?

Just keep the CDs in a shoe box until the contents are irrelevant, 
then dispose of them. 



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links