TLUG Mailing List

On Mon, May 12, 2014 at 4:25 PM, Stephen J. Turnbull <stephen@example.com> wrote:

> You offered a solution (that I did not test) using git. I am sure
> readers will propose alternatives. And this was the target of the
> question: which solution would be the best for such a requisite?

"Best"? Depends on lots of things.

My idea (beside last meeting) was to get a few solutions. git is one,

only one. When we have a few, we just can try all of them and discuss.

"Good"? Sure -- git is a highly optimized application for tracking
and comparing the contents of files. I happen to know a bit about
extracting the information you want from a git object database. git
would be a lot more reliable than coding the algorithms myself.

So, let's compare the performance of your line against others. I won't

tell anything about git itself, I don't know how it works internally. However,

I believe there is no magic there (it is so difficult to compare 2 files,

so finding 2 identical files within 10,000 is not as easy as running "git").

The point was to be fast (my initial question). I have a 4,000+ directory

that I could use for testing different solutions. If you could provide me

a full script, I will be happy tu run it and give back the result to the list,

with other proposals.

I suggest the following syntax for everybody:

$ the-perfect-script-to-find-dups [-c] [-d db] [-x ext] [-s size] dir

with:

-c: create or init the DB (if any DB).

-d DB: database name, if any in your solution. Default should be: $HOME/the-perfect-script-to-find-dup.DB.

It could be a directory, if your solution implies a directory.

-s size: the minimal size for a file to be considered (the reason for this is that we don't want

to consider small files). Default should be zero (all files).

-x ext: Consider only files with "ext" extension. I suggest ext to be case-insensitive (mp3 = MP3). Default should be

anything (no filter).

dir: the directory where we want to find the duplicates.

My test will be:

- to run your script initially, and time it.

- to copy a file in the subtree, and time the command. Check also that it was found.

- other tests: rename a file to an already existing one, move the old one to a new name or directory, etc...

Please let me know if you have more tests in mind.

The target is to have the fastest and realiable way to find the duplicates. The initial round would not be so important, if your

choice is to have a DB, only the "normal" run timing is important.

> Let say another way:

What makes you think I didn't understand the first time?

Nothing. I just made this remark after the few answers we got; I believed I was not clear.

br.

--
2 + 2 = 5, for very large values of 2.

Re: [tlug] "How to"