SubNucleon

Simple Duplicate File Checker

2009-07-31T09:23:00.004+02:00

Since it's harder to find an application that does what you need these days than it is to write one yourself, here's a small Java program that checks the files in two directories and prints out the absolute path names of any two files (one from each directory) that are of the same size. It does this for all same-size pairs. Setting up an alias in your favourite shell can help you use the app much faster and with nicer syntax. I use:

alias checkDuplicateSizes="java -cp ~/scripts DuplicateSizeChecker"

The path names of the possibly duplicated files are surrounded by single quotes (') and separated by a space. This makes them ideal for using with cmp. You can either copy and paste file pairs of interest by hand, or set up a script to cmp all of the files the DuplicateSizeChecker finds. You can easily grep out the "non-interesting" lines (which make the output more human readable when not used in a script) using something like the following:

checkDuplicateSizes /tmp/a /tmp/b | grep "'/"

Finding identical files is then pretty easy with a little script (for which I also have an alias called checkDuplicates):

#!/bin/bash

java -cp ~/scripts DuplicateSizeChecker "$1" "$2" | grep "'/" |
while read filePair; do
        eval cmp -s $filePair
        same=`echo $?`

        #echo $filePair
        #echo $same

        if [ $same == "0" ]; then
                echo
                echo The following files are identical
                echo $filePair | sed "s/\\' \\'/\\'\\`echo -e '\n\r'`\\'/g"
        fi
done

That's it. This has helped me solve my problems (for now) and I hope it helps you too. I haven't included any options like suppressing certain output, having more verbose output or different formatting, recursing through subdirectories, etc because I wanted to get this done quickly and because I am trying to be a little more YAGNI. Feel free to take, use, adapt or do whatever you like with this code (be respectful and reasonable, and leave a comment if it helped you out somehow - especially if you modify the code to do something smarter).

Here's some sample output:

$ checkDuplicateSizes /tmp/one /tmp/two

The following files have the same length (0 B)
'/tmp/one/one empty file' '/tmp/two/twoemptyfile'

The following files have the same length (22 B)
'/tmp/one/onesame' '/tmp/two/twosame'

The following files have the same length (15 B)
'/tmp/one/onesamesize' '/tmp/two/twosamesize'
$ checkDuplicates /tmp/one /tmp/two

The following files are identical
'/tmp/one/one empty file'
'/tmp/two/twoemptyfile'

The following files are identical
'/tmp/one/onesame'
'/tmp/two/twosame'

Here's the meat:

import java.io.File;

public class DuplicateSizeChecker {
        public static void main(String[] args){
                if(args.length < 2){
                        System.err.println("Please specify two different directories as the first two arguments");
                        return;
                }


                File folder1 = new File(args[0]);
                File folder2 = new File(args[1]);

                if(!folder1.isDirectory() || !folder2.isDirectory() || folder1.equals(folder2)){
                        System.err.println("Please specify two different directories as the first two arguments");
                }
                else{
                        if(args.length > 2){
                                System.out.println("More than two arguments supplied; only the first two are necessary; subsequent ones will be ignored"$
                        }

                        int size1 = folder1.list().length;
                        int size2 = folder2.list().length;

                        if(size1 > size2) {
                                for(File f1 : folder1.listFiles()){
                                        for(File f2 : folder2.listFiles()){
                                                if(f1.isFile() && f2.isFile() && f1.length() == f2.length()){
                                                        printFileInfo(f1, f2);
                                                }
                                        }
                                }
                        }
                        else{
                                for(File f2 : folder2.listFiles()){
                                        for(File f1 : folder1.listFiles()){
                                                if(f1.isFile() && f2.isFile() && f1.length() == f2.length()){
                                                        printFileInfo(f1, f2);
                                                }
                                        }
                                }
                        }
                }
        }

        private static void printFileInfo(File f1, File f2){
                System.out.println();
                System.out.println("The following files have the same length (" + f1.length() + " B)");
                System.out.println("'" + f1.getAbsolutePath() + "' '" + f2.getAbsolutePath() + "'");
                //System.out.println("\"" + f1.getAbsolutePath() + "\" \"" + f2.getAbsolutePath() + "\"");
        }

}

Download Link Script

2009-06-02T12:10:00.003+02:00

Sometimes you might see a URL (in plain, non-linked text) for something you would like to download. You can't right click and then "Save Link As..." (or similar, depending on your browser) because, well, it isn't a link. So oftentimes I find myself creating these quick, little, one-time HTML files to create a download link for myself. How about making a script to do this work instead? Yup, seems like a good idea. Here it is:

# filename is based on URL, but first replace special characters in the URL with dots
filename="`echo "$1" | tr -s " ~!@#$%^&*()+=[]{}\\|;':<>?,./" "."`"

# put the file in the /tmp directory (or any directory of your choosing), as defined below
tempdir="/tmp"
filepath="$tempdir/$filename.htm"

# create the quick HTML required for the download link
link="<html><head><title>Download link for $1</title></head><body><a href=\"$1\">$1</a></body></html>"

# write the download link HTML to the temporary file
echo $link > $filepath

# launch the file with the link
open $filepath

And you're done!

Open Source Projects

2009-05-18T02:24:00.004+02:00

I started some projects on Google Code last year and have recently picked up work on them again. You may want to check them out - there is already a perfectly usable (and extremely useful, in my opinion) Java library for checking object state during runtime, checking passed method parameters and exception chaining. This is particularly useful for things like dependency injection and can help you painlessly throw exceptions early (a good practice) - sometimes as early as a constructor call. Fewer meaningless NullPointerExceptions and fewer exceptions with no message will ultimately result in faster development and more stable production code.

Another project currently being worked on quite actively is meant to create an abstraction of a source code repository (version control system) with the goal of allowing developers to take better advantage of the repository. Here is the "Overview" blurb as of today:

The goals of this project are to allow teams to use an SVN repository to develop more efficiently by allowing for:

easy separation of concerns for various files (modular area use)

trunk stability (code sandboxing)

integration with an arbitrary tasklist, bug tracker or other applicable project management tool

All three features should be usable individually in case a team wants to use one, but not the other(s). They will be implemented in a library, which should then have a number of interfaces built on top of it, including (but not necessarily limited to) a CLI, Eclipse plug-in and a NetBeans plug-in.

Another library planned for the future deals with Tagging. Currently, tagging is flat and unflexible. My idea is to introduce hierarchy and some other interesting concepts/features into Tagging.

Take a look!
http://code.google.com/p/generic-libraries/

Remember The Milk on your iPod

2008-12-23T16:50:00.010+01:00

A little while ago, I decided I really wanted to have an easy way to get my Remember The Milk tasks from RTM to my iPod so I could have them when I am away from my laptop since I always have my iPod with me.

For those of you with an iPhone, this doesn't really apply since you always have your tasks available via the iPhone-optimized RTM webapp (unless you're a non-pro user). For those of you with an iPod Touch, this certainly still applied when I originally created it in October due to the non-permanent nature of the device's internet connection. However, with the release of the offline-capable, native iPhone/iPod Touch RTM app (available from the app store), my approach applies to a slightly smaller audience. Nevertheless, non-pro users and owners of "regular" iPods will still definitely find this useful.

In any case, my approach was the following:

Pull RTM tasks in ATOM format from a particular list or smart list to a temporary file
Transform the ATOM XML to a plain text format
Copy the plain text tasks file to the notes directory on my iPod so I can view them on my device

In addition, I wanted to make this convenient, easy and unintrusive so I would actually do it, which yielded the following three requirements:

Run the above three tasks in sequence with no intervention (i.e. a script)
Run the script nicely, i.e.
- The terminal window appears in a visually pleasing way, or does not appear
- The terminal window closes after the script is run
Run the script automatically when the iPod is connected

Retrieving the Tasks
Depending on your platform, retrieving the tasks requires a slightly different utility. On Mac OS X, I retrieve the tasks using curl. On many Linux distros, it can be done using wget (which is how I did it on Ubuntu when I first started working on this).

Using curl:
curl --silent --url $TaskListURL --output $TempRawTaskFilePath

Using wget:
wget --quiet --no-check-certificate $TaskListURL -O $TempRawTaskFilePath

Since the ultimate goal was to run these apps from a script, I specified the options to suppress output (silent/quiet). wget requires the --no-check-certificate option if retrieving tasks via an HTTPS URL (which one should be). Both apps require (of course) the URL to retrieve, which is specified above in the variable TaskListURL. I also specified that the retrieved tasks should be stored to a file whose path is in the variable TempRawTaskFilePath.

The URL of the ATOM feed containing the tasks in a desired list can be obtained as follows:

Go to Remember the Milk
Log in, if not already logged in
Go to your Tasks
Select the desired list (or smart list)
Make sure no tasks are selected
Click the Atom link in the List tab of the floating right panel, and copy the URL from your browser's address bar, OR, if you use Firefox, simply right click on the Atom link and select "Copy Link Location" from the context menu since FF will try to add the feed to your bookmarks using its internal feed reader if you just click the link

Transforming the ATOM feed
Apache has a standards-complaint XSLT processor called Xalan. On Ubuntu, I used the C++ Version of Xalan. On OS X, I use the Java version. Either way, it is easy enough to use it from the command line once you have the ATOM data and an XSL file specifying the way it should be transformed.

Using Java version:
java -jar xalan.jar -text -in $TempRawTaskFilePath -xsl atom2plain.xsl -out $TempTransformedTaskFilePath

Using C++ version:
xalan -in $TempRawTaskFilePath -xsl atom2plain.xsl -out $TempTransformedTaskFilePath

Playing around with the XSL took some time to get the desired plain text output. It is included in the download at the bottom.

Running the tasks from a script
The above steps were easy enough to capture in a bash script. I then pulled out all of the generic parts of the script into a seperate script so I can easily repeat the task for a number of task lists (or smart lists). Finally, I created a second "runner script" to run the generic script with parameters such that I get the tasks that I want copied to my iPod.

The two bash scripts are included in the download at the bottom.

Running the script nicely
In order to have the runner script run in an unintrusive manner, I created a Terminal configuration (on OS X) that executes the script in a Terminal window that:

starts minimized (i.e. only shows up in the dock)
shows up mostly transparent and quite small if de-iconified
exits when the script finishes its tasks

This Terminal configuration is also included in the download at the bottom.

Running the script automatically
Finally, in order to fully automate the process, I looked for a utility that would run the script whenever the iPod was connected to my computer. The app I came across is Do Something When (DSW). It does exactly when I wanted - whenever the iPod is connected, DSW detects it and runs the Terminal configuration (which in turn runs the script).

The goods
Here is a ZIP archive containing:

The XSL file (rtm2ipod.xsl)
The generic bash script (rtm2ipod.sh)
The runner bash script (run_rtm2ipod.sh)
The Terminal configuration (term_run_rtm2ipod.term)

Further work
It would be nice if ways were found to do the following things (and posted in the comments) so that the entire flow described in this post can be realized on Linux and Windows machines as well as Mac OS X (since I have only described and created the full flow for OS X):

Create a batch equivalent of the bash scripts (for Windows systems)
Create the equivalent of a "Terminal configuration" - something that achieves the goals of running the script in a visually pleasing, auto-closing window that starts off minimized/iconified (for Linux and Windows systems)
Find the equivalent of Do Something When so the script can be run automatically when an iPod is connected (for Linux and Windows systems)

If you have good, concrete hints on how to achieve the above, or time to actually create the artifacts, please take the time to write a comment. As always, feel free to let me know if you found this post useful. Thanks!

First Post

2008-11-17T23:06:00.000+01:00

Over the years I have found I often email myself information about:

cool things I find
important or useful information
mini-projects I put together in my spare time
ideas for mini-projects I should put together in my spare time
etc

I figure it's high time I find a better way of organizing this information. Additionally, most of the stuff I save tends to be fruit(s) of many hours of research and experimentation labour; there's no good reason this information shouldn't be readily available for other people to find and use.

Often I find exactly the information I'm looking for easily and quickly. Many times it's on someone's blog. This is exactly the information I will not be blogging about. My aim is for the information I put up to be interesting, useful and original. If you know of another place on the web with similar content to one of my posts (and there's no link already), chances are I couldn't easily find it, so feel free to bring it to my attention and (do the world a favour and) paste the URL.

Needless to say, comments are welcome (this is Web 2.0 stuff after all). Rudeness, unnecessary ranting and so forth are not. If your comment contains excessive and unnecessary profanity, inappropriate content or other attributes that cause me to look at it and think something along the lines of "wow, this person is..." followed by "immature," "a bigot," "a real jerk" or anything similar, chances are it will not make it past my filter.

I will try to make sure my posts are useful, interesting and time-saving. Please make sure your comments are tailored with the same spirit.