Friday, July 31, 2009

Simple Duplicate File Checker

Since it's harder to find an application that does what you need these days than it is to write one yourself, here's a small Java program that checks the files in two directories and prints out the absolute path names of any two files (one from each directory) that are of the same size. It does this for all same-size pairs. Setting up an alias in your favourite shell can help you use the app much faster and with nicer syntax. I use:
alias checkDuplicateSizes="java -cp ~/scripts DuplicateSizeChecker"


The path names of the possibly duplicated files are surrounded by single quotes (') and separated by a space. This makes them ideal for using with cmp. You can either copy and paste file pairs of interest by hand, or set up a script to cmp all of the files the DuplicateSizeChecker finds. You can easily grep out the "non-interesting" lines (which make the output more human readable when not used in a script) using something like the following:
checkDuplicateSizes /tmp/a /tmp/b | grep "'/"


Finding identical files is then pretty easy with a little script (for which I also have an alias called checkDuplicates):
#!/bin/bash

java -cp ~/scripts DuplicateSizeChecker "$1" "$2" | grep "'/" |
while read filePair; do
eval cmp -s $filePair
same=`echo $?`

#echo $filePair
#echo $same

if [ $same == "0" ]; then
echo
echo The following files are identical
echo $filePair | sed "s/\\' \\'/\\'\\`echo -e '\n\r'`\\'/g"
fi
done



That's it. This has helped me solve my problems (for now) and I hope it helps you too. I haven't included any options like suppressing certain output, having more verbose output or different formatting, recursing through subdirectories, etc because I wanted to get this done quickly and because I am trying to be a little more YAGNI. Feel free to take, use, adapt or do whatever you like with this code (be respectful and reasonable, and leave a comment if it helped you out somehow - especially if you modify the code to do something smarter).

Here's some sample output:
$ checkDuplicateSizes /tmp/one /tmp/two

The following files have the same length (0 B)
'/tmp/one/one empty file' '/tmp/two/twoemptyfile'

The following files have the same length (22 B)
'/tmp/one/onesame' '/tmp/two/twosame'

The following files have the same length (15 B)
'/tmp/one/onesamesize' '/tmp/two/twosamesize'
$ checkDuplicates /tmp/one /tmp/two

The following files are identical
'/tmp/one/one empty file'
'/tmp/two/twoemptyfile'

The following files are identical
'/tmp/one/onesame'
'/tmp/two/twosame'




Here's the meat:
import java.io.File;

public class DuplicateSizeChecker {
public static void main(String[] args){
if(args.length < 2){
System.err.println("Please specify two different directories as the first two arguments");
return;
}


File folder1 = new File(args[0]);
File folder2 = new File(args[1]);

if(!folder1.isDirectory() || !folder2.isDirectory() || folder1.equals(folder2)){
System.err.println("Please specify two different directories as the first two arguments");
}
else{
if(args.length > 2){
System.out.println("More than two arguments supplied; only the first two are necessary; subsequent ones will be ignored"$
}

int size1 = folder1.list().length;
int size2 = folder2.list().length;

if(size1 > size2) {
for(File f1 : folder1.listFiles()){
for(File f2 : folder2.listFiles()){
if(f1.isFile() && f2.isFile() && f1.length() == f2.length()){
printFileInfo(f1, f2);
}
}
}
}
else{
for(File f2 : folder2.listFiles()){
for(File f1 : folder1.listFiles()){
if(f1.isFile() && f2.isFile() && f1.length() == f2.length()){
printFileInfo(f1, f2);
}
}
}
}
}
}

private static void printFileInfo(File f1, File f2){
System.out.println();
System.out.println("The following files have the same length (" + f1.length() + " B)");
System.out.println("'" + f1.getAbsolutePath() + "' '" + f2.getAbsolutePath() + "'");
//System.out.println("\"" + f1.getAbsolutePath() + "\" \"" + f2.getAbsolutePath() + "\"");
}

}

Tuesday, June 2, 2009

Download Link Script

Sometimes you might see a URL (in plain, non-linked text) for something you would like to download. You can't right click and then "Save Link As..." (or similar, depending on your browser) because, well, it isn't a link. So oftentimes I find myself creating these quick, little, one-time HTML files to create a download link for myself. How about making a script to do this work instead? Yup, seems like a good idea. Here it is:


# filename is based on URL, but first replace special characters in the URL with dots
filename="`echo "$1" | tr -s " ~!@#$%^&*()+=[]{}\\|;':<>?,./" "."`"

# put the file in the /tmp directory (or any directory of your choosing), as defined below
tempdir="/tmp"
filepath="$tempdir/$filename.htm"

# create the quick HTML required for the download link
link="<html><head><title>Download link for $1</title></head><body><a href=\"$1\">$1</a></body></html>"

# write the download link HTML to the temporary file
echo $link > $filepath

# launch the file with the link
open $filepath



And you're done!

Monday, May 18, 2009

Open Source Projects

I started some projects on Google Code last year and have recently picked up work on them again. You may want to check them out - there is already a perfectly usable (and extremely useful, in my opinion) Java library for checking object state during runtime, checking passed method parameters and exception chaining. This is particularly useful for things like dependency injection and can help you painlessly throw exceptions early (a good practice) - sometimes as early as a constructor call. Fewer meaningless NullPointerExceptions and fewer exceptions with no message will ultimately result in faster development and more stable production code.

Another project currently being worked on quite actively is meant to create an abstraction of a source code repository (version control system) with the goal of allowing developers to take better advantage of the repository. Here is the "Overview" blurb as of today:

The goals of this project are to allow teams to use an SVN repository to develop more efficiently by allowing for:

  • easy separation of concerns for various files (modular area use)

  • trunk stability (code sandboxing)

  • integration with an arbitrary tasklist, bug tracker or other applicable project management tool



All three features should be usable individually in case a team wants to use one, but not the other(s). They will be implemented in a library, which should then have a number of interfaces built on top of it, including (but not necessarily limited to) a CLI, Eclipse plug-in and a NetBeans plug-in.


Another library planned for the future deals with Tagging. Currently, tagging is flat and unflexible. My idea is to introduce hierarchy and some other interesting concepts/features into Tagging.

Take a look!
http://code.google.com/p/generic-libraries/