Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
invictius

Program for weeding out duplicate files?

Recommended Posts

I'm in the process of downloading every single wad on all the servers here: http://camoyoshi.floorchan.org/master/

So far the total size is 260gb (!!!) with only a few hundred left - 26,000 files so far. However I want to get rid of duplicates - the download manager has been set to skip dupes, but some remain, and of course there will be dupes in my /idgames mirror. Can anyone recommend a file manager that will delete duplicates? I think Norton had something ages ago that did the job, though it wasn't freeware.

Share this post


Link to post

I've done it with php and Flash. Just make it keep a log of files. When there's a dupe... don't download.

Share this post


Link to post

I'm guessing you're running Windows, you don't mention in your original post.

Were you on a Linux/Mac, I would say fdupes

http://code.google.com/p/fdupes/

Debian/Ubuntu has a package of it. It will delete duplicate files that it finds, replacing the files with a hardlink (basically a UNIX-y pointer for files), saving you a bunch of space.

There's also teh googles and wikipaedias

https://www.google.com/search?q=fdupes%20windows
http://en.wikipedia.org/wiki/List_of_duplicate_file_finders

For what it's worth, /idgames currently holds 34418 files, at around 32.56G of space.

Share this post


Link to post

Ive been trying to think of a smart way to archive wads with repeat names. I think I accidentally pasted over a bunch of them when I downloaded the idgames archive and I dont really know how to keep different wads with identical names without renaming them or having a weird network of folders in my wad directory. Wads like HANGAR.WAD, BASE.WAD, CASTLE.WAD, HELL.WAD, etc. Common names like that.

Share this post


Link to post

I keep my archive mirror separated from the other wads I've collected, though that still hasn't eliminated the need to rename the odd file or few.

Share this post


Link to post
40oz said:

I dont really know how to keep different wads with identical names without renaming them or having a weird network of folders in my wad directory. Wads like HANGAR.WAD, BASE.WAD, CASTLE.WAD, HELL.WAD, etc. Common names like that.

This is slightly related to me, in the sense that I do the same thing, but have not reached that point with WADs so much as I have with the files I organize at work.

At this point, why not append the author's name/last name to the wad file? My logic would say that if you now have wads stacking on your hard drive by the same exact names it wouldn't be a bad idea to include the author's first/last name in the wad file. That way, even if you have CASTLE.WAD (not named because you got it last year) and then CASTLE_FUENTES.WAD, you know which one was done by Mr. Fuentes because you included his name in the actual filename. Denoting which wads belong to which creators helps narrow down the field, and in my mind is a very good way to do so (at least with WAD files and clients sharing similar titles :P)

Granted, its not a surefire way to organize, but hopefully should help. If author names are no good, then look for something else that will help specifically identify a WAD other than it's 8-character name. It still requires you to rename them when you first put them on your computer, but 5-10 seconds of file name changing is a simple price to pay for organization

Share this post


Link to post

This is one of the things I do everyday.

Before I used a file duplicate checking program, I had over 1 million files of just demos. With demos and wads being posted and re-posted so many times, a duplicate file manager is a must for any file collector.

Unfortunately, solving this issue is not as easy as it would first appear since filenames can be descriptive, zipfiles can have different compression types and levels, and there are more than one popular archive type.

I use a combination of FastDuplicateFileFinder, Lookdisk, and linux scripts using md5 hashes.
FDFF is probably the first basic tool you are looking for. It purely matches whole files for duplicates. You can choose the directories you wish to compare.
LookDisk works the problem in a different way. It does the same thing, but you can choose all the files in a specific directory to be moved/deleted(instead of picking files individually).
LD can also look inside rar/zip 1.0/2.0 files for duplicates.

as someone who has done this for many years, I recommend: 1) having a main master folder that you compare everything new against. 2) don't 'work' on your master copy, only work on a copy of your data until you are sure it is good. 3) segregate your data into smaller chunks -- eg. Zdaemon demos, doom wads, resource wads, etc... 4) use 7zip to unzip large collections of miscellaneous data(misc data, not the idgames archive!) as it can auto-rename files as they are extracted to prevent overwriting. 5) use winrar to rescue broken zipfiles. 6) use winrar to mass re-zip files (so they are all zip2.0 and compressed in the same method -- so they are recheckable later as duplicates) 7) rom-zipper and batchtoolkit both have useful features, but also have limitations that can cause dataloss, so be careful. I use simple linux scripts to do things like zip files together (like av.txt and av.wad into av.zip) 8) BeyondCompare(not freeware) or TreeComp can compare entire directory trees.(this is useful for multiple copies of the Compet-N archive or the idgames archive).
9) back up your data!!!

All windows programs suffer from windows file-handing limitations(especially windows7/8) which can cause dataloss. You've been warned.

good luck

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×