Evidently it’s a good idea to test your Hadoop MapReduce functions on a small subset of data with Hadoop running in standalone mode. If you are new to Hadoop and feeling your way, like I am, this makes perfect sense, as you get to practice with the map and reduce functions without having to worry about setting up a cluster of nodes. It also gives you the opportunity to send all sorts of stuff to stdout, so you can find out what’s in all the Hadoop API classes; ReflectionToStringBuilder is your friend in this case.

One thing you have to do before invoking Hadoop though, is to set the classpath so that it can find your newly compiled classes. This is pretty trivial if you don’t use any third party libraries:

# Assuming you are seting this from the same folder as you
# are building your code with Maven...
export HADOOP_CLASSPATH=./target/classes

When you start adding third party libraries however, it’s not as simple. If you choose wisely, then they may already be included in the Hadoop installation, for example Apache Commons Lang 2.x. If like me, you’ve moved onto Apache Commons Lang 3.x then you have to include the JAR on the HADOOP_CLASSPATH so that it can be picked up and used. If you are using a lot of third party libs, you would be a fool to try and manage this by hand.

If you are using Maven as your build tool, then you can use the Maven Dependency Plugin to copy all your thrid party JARs to a suitable location for inclusion on the classpath. Just make sure you have included and excluded the correct dependancy scopes, otherwise you’ll have a bucket full of JARs that you don’t need in your chosen location.

Then it’s just a case of modifying the classpath to also point to the folder that contains all your third party JARs and away you go.

# Assuming you are seting this from the same folder as you
# are building your code with Maven and have put all your
# 3rd paty JARs in target/libs...
export HADOOP_CLASSPATH=./target/classes:./target/libs/*
view raw hosted with ❤ by GitHub

I have to confess that when I realised that I needed to create a classpath with all the third party JARs on it, I wondered if I could do some bash scripting to iterate over the folder and produce a classpath that way. Glad I did a google first, as I’d totally forgotten about using the wildcard on a classpath, as there’s not really much call for that kind of thing when writing webapps…

Leave a Reply