Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] "How to"



Bruno Raoult <braoult@example.com> writes:
 > On Mon, May 12, 2014 at 5:35 AM, Stephen J. Turnbull <stephen@example.com>wrote:

 >>> 1- You have 10,000 files, and you want to find
 >>> duplicates. Sometimes, 1 file changes, or you add/remove one, so
 >>> you want to find the changes quickly (let say daily). How?
 >
 > git init; git add .; git commit; while true; do git status; sleep 86400;
 > done

 > I am not sure tu understand (or maybe my question was not
 > clear). Let say you have ./a/b/c/d/file1 and ./a/b/z/file2 in the
 > tree. They are binary the same files. My question was to find them.

For two files in the same directory that have the same content but
different names, 

    git cat-file tree `git cat-file commit HEAD | grep tree | cut -b 5-` \
    | sort -f 3 | uniq -D -w 52

(untested; probably requires GNU uniq).  To handle recursion is
(recursively ;-) left as an exercise for the reader.

 > So we extracted the data, piped it, and saved in a file. Then? What
 > about the next day, when you want to refresh?

   git ls-files --modified | xargs metadata-extractor-and-updater

If you need to do this in real time, it's a difficult problem.  If you
only need to do it occasionally, this is *exactly* the problem that
Linus designed git to solve (except that Linus also needs to store the
content; a modified git that never actually stores blobs would
probably save you a lot of space!)

Of course if (like Kalin) you're dealing with terabytes, this is still
way slow (even if you can compare bytes on the order of once per CPU
cycle, you're still talking about thousands of seconds).  You really
need to be able to ensure that files aren't changed behind your back,
and some special handling for files >10GB would be needed.  But for
people dealing with files on the order of a CD or less, git should do
the job quickly enough.



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links