Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Poll: OpenOffice or LibreOffice?



On Sun, May 18, 2014 at 2:35 PM, Darren Cook <darren@example.com> wrote:
>>> Git hashes each file (that exists now, or has existed in the past), and
>>> creates one file per hash code.
>>
>> To be clear: there is no copy of the actual files, right?
>
> Unless I've misunderstood, there is a copy. (It might be zipped, in
> which case my below numbers are off.) Otherwise when you delete your
> file, going back in the git history wouldn't be able to recover it.
>
>> Practical example: someone has a disk 90% full of music, pics, and
>> video. No space
>> anywhere else. Will git need another disk, just to find dups?
>
> Git is the wrong tool for that (IMHO). Go straight to md5 hashes.
>
>>> (So it can detect duplicates in the directory tree; but you could
>>> achieve the same by just writing a script to run md5sum on every file.)
>>
>> This was my initial question (and my solution). I just wondered if git
>> could do the same
>> with 2 lines instead of my 100 :)
>
> I bet someone with more bash skills than me could make it a one-liner.
> Something like:
>   find . | xargs md5sum | sort | uniq

This line won't work, but this is not the point, and it is the idea
that I used first: Having a
checksum of every file, then check the non unique ones. As checksum is
*very* expensive,
I went on on keeping the checksum date as well as the checksum in a text file.
I wanted also to be able to add an external dir (I mean files outside
the initial directory tree),
and also to find same file names (which can sometimes help to find
dups; especially true for
pictures, at least for me).

Last version was to use a small DB, where I expected to add more
information, such as some meta-data,
so that I could (maybe) be able to detect basic changes, such as
rotations. In that case, the checksum
would have been the "data" part only (excluding meta-information). I
never finished it, so it is now
only a simple checksum database, but very easy to update and search
(by filename, md5, etc...).

> 112MB in total, and .git is 26MB of that. I see roughly 30MB is being
> excluded by .gitignore. So:
>    56MB real files gives a 26MB git directory.
>
> Much less than by 2.5 ratio. This could well be compression. Or me
> misunderstanding how git works.

Thanks. This was my guess. Therefore git will not be appropriate as
just a dup search.

I will go on with my own script. I could send it to the list, but it
is really not ready yet, and I am not sure
it could have any interest for anybody here... At least in its actual form.

br.

-- 
2 + 2 = 5, for very large values of 2.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links