Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] "How to"



On Mon, May 12, 2014 at 10:52 AM, Stephen J. Turnbull <stephen@example.com> wrote:
Bruno Raoult <braoult@example.com> writes:
 > On Mon, May 12, 2014 at 5:35 AM, Stephen J. Turnbull <stephen@example.com>wrote:

 >>> 1- You have 10,000 files, and you want to find
 >>> duplicates. Sometimes, 1 file changes, or you add/remove one, so
 >>> you want to find the changes quickly (let say daily). How?
 >
 > git init; git add .; git commit; while true; do git status; sleep 86400;
 > done

 > I am not sure tu understand (or maybe my question was not
 > clear). Let say you have ./a/b/c/d/file1 and ./a/b/z/file2 in the
 > tree. They are binary the same files. My question was to find them.

For two files in the same directory that have the same content but
different names,

    git cat-file tree `git cat-file commit HEAD | grep tree | cut -b 5-` \
    | sort -f 3 | uniq -D -w 52

(untested; probably requires GNU uniq).  To handle recursion is
(recursively ;-) left as an exercise for the reader.

If files are in the same dir, why using git?
 
  > So we extracted the data, piped it, and saved in a file. Then? What
 > about the next day, when you want to refresh?

   git ls-files --modified | xargs metadata-extractor-and-updater

If you need to do this in real time, it's a difficult problem.

This was not in my initial question.

Of course if (like Kalin) you're dealing with terabytes, this is still
way slow (even if you can compare bytes on the order of once per CPU
cycle, you're still talking about thousands of seconds). You really
need to be able to ensure that files aren't changed behind your back,
and some special handling for files >10GB would be needed.  But for
people dealing with files on the order of a CD or less, git should do
the job quickly enough.

Changes "behind the back" is not an issue. You just want to find dups, from
time to time. The second question about metadata is the same in fact.

You offered a solution (that I did not test) using git. I am sure readers will propose
alternatives. And this was the target of the question: which solution would be the best
for such a requisite?

Let say another way: You have your 10,000 pictures. You plug your phone/camera, and,
as you are not sure if pics were already imported or not, and you don't want to overwrite
anything. You will import them in "Pictures/new-yyyy-mm-dd". After that, you want to find
the new possible dups (I already wrote that the first scan is a special case, therefore already done).

br.


--
To unsubscribe from this mailing list,
please see the instructions at http://lists.tlug.jp/list.html

The TLUG mailing list is hosted by ASAHI Net, provider of mobile and
fixed broadband Internet services to individuals and corporations.
Visit ASAHI Net's English-language Web page: http://asahi-net.jp/en/



--
2 + 2 = 5, for very large values of 2.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links