Friday, July 31, 2009

Simple Duplicate File Checker

Since it's harder to find an application that does what you need these days than it is to write one yourself, here's a small Java program that checks the files in two directories and prints out the absolute path names of any two files (one from each directory) that are of the same size. It does this for all same-size pairs. Setting up an alias in your favourite shell can help you use the app much faster and with nicer syntax. I use:
alias checkDuplicateSizes="java -cp ~/scripts DuplicateSizeChecker"


The path names of the possibly duplicated files are surrounded by single quotes (') and separated by a space. This makes them ideal for using with cmp. You can either copy and paste file pairs of interest by hand, or set up a script to cmp all of the files the DuplicateSizeChecker finds. You can easily grep out the "non-interesting" lines (which make the output more human readable when not used in a script) using something like the following:
checkDuplicateSizes /tmp/a /tmp/b | grep "'/"


Finding identical files is then pretty easy with a little script (for which I also have an alias called checkDuplicates):
#!/bin/bash

java -cp ~/scripts DuplicateSizeChecker "$1" "$2" | grep "'/" |
while read filePair; do
eval cmp -s $filePair
same=`echo $?`

#echo $filePair
#echo $same

if [ $same == "0" ]; then
echo
echo The following files are identical
echo $filePair | sed "s/\\' \\'/\\'\\`echo -e '\n\r'`\\'/g"
fi
done



That's it. This has helped me solve my problems (for now) and I hope it helps you too. I haven't included any options like suppressing certain output, having more verbose output or different formatting, recursing through subdirectories, etc because I wanted to get this done quickly and because I am trying to be a little more YAGNI. Feel free to take, use, adapt or do whatever you like with this code (be respectful and reasonable, and leave a comment if it helped you out somehow - especially if you modify the code to do something smarter).

Here's some sample output:
$ checkDuplicateSizes /tmp/one /tmp/two

The following files have the same length (0 B)
'/tmp/one/one empty file' '/tmp/two/twoemptyfile'

The following files have the same length (22 B)
'/tmp/one/onesame' '/tmp/two/twosame'

The following files have the same length (15 B)
'/tmp/one/onesamesize' '/tmp/two/twosamesize'
$ checkDuplicates /tmp/one /tmp/two

The following files are identical
'/tmp/one/one empty file'
'/tmp/two/twoemptyfile'

The following files are identical
'/tmp/one/onesame'
'/tmp/two/twosame'




Here's the meat:
import java.io.File;

public class DuplicateSizeChecker {
public static void main(String[] args){
if(args.length < 2){
System.err.println("Please specify two different directories as the first two arguments");
return;
}


File folder1 = new File(args[0]);
File folder2 = new File(args[1]);

if(!folder1.isDirectory() || !folder2.isDirectory() || folder1.equals(folder2)){
System.err.println("Please specify two different directories as the first two arguments");
}
else{
if(args.length > 2){
System.out.println("More than two arguments supplied; only the first two are necessary; subsequent ones will be ignored"$
}

int size1 = folder1.list().length;
int size2 = folder2.list().length;

if(size1 > size2) {
for(File f1 : folder1.listFiles()){
for(File f2 : folder2.listFiles()){
if(f1.isFile() && f2.isFile() && f1.length() == f2.length()){
printFileInfo(f1, f2);
}
}
}
}
else{
for(File f2 : folder2.listFiles()){
for(File f1 : folder1.listFiles()){
if(f1.isFile() && f2.isFile() && f1.length() == f2.length()){
printFileInfo(f1, f2);
}
}
}
}
}
}

private static void printFileInfo(File f1, File f2){
System.out.println();
System.out.println("The following files have the same length (" + f1.length() + " B)");
System.out.println("'" + f1.getAbsolutePath() + "' '" + f2.getAbsolutePath() + "'");
//System.out.println("\"" + f1.getAbsolutePath() + "\" \"" + f2.getAbsolutePath() + "\"");
}

}