Evidently it’s a good idea to test your Hadoop MapReduce functions on a small subset of data with Hadoop running in standalone mode. If you are new to Hadoop and feeling your way, like I am, this makes perfect sense, as you get to practice with the map and reduce functions without having to worry about setting up a cluster of nodes. It also gives you the opportunity to send all sorts of stuff to stdout, so you can find out what’s in all the Hadoop API classes; ReflectionToStringBuilder is your friend in this case.

One thing you have to do before invoking Hadoop though, is to set the classpath so that it can find your newly compiled classes. This is pretty trivial if you don’t use any third party libraries:

# Assuming you are seting this from the same folder as you
# are building your code with Maven...
export HADOOP_CLASSPATH=./target/classes

When you start adding third party libraries however, it’s not as simple. If you choose wisely, then they may already be included in the Hadoop installation, for example Apache Commons Lang 2.x. If like me, you’ve moved onto Apache Commons Lang 3.x then you have to include the JAR on the HADOOP_CLASSPATH so that it can be picked up and used. If you are using a lot of third party libs, you would be a fool to try and manage this by hand.

If you are using Maven as your build tool, then you can use the Maven Dependency Plugin to copy all your thrid party JARs to a suitable location for inclusion on the classpath. Just make sure you have included and excluded the correct dependancy scopes, otherwise you’ll have a bucket full of JARs that you don’t need in your chosen location.

Then it’s just a case of modifying the classpath to also point to the folder that contains all your third party JARs and away you go.

# Assuming you are seting this from the same folder as you
# are building your code with Maven and have put all your
# 3rd paty JARs in target/libs...
export HADOOP_CLASSPATH=./target/classes:./target/libs/*
view raw hosted with ❤ by GitHub

I have to confess that when I realised that I needed to create a classpath with all the third party JARs on it, I wondered if I could do some bash scripting to iterate over the folder and produce a classpath that way. Glad I did a google first, as I’d totally forgotten about using the wildcard on a classpath, as there’s not really much call for that kind of thing when writing webapps…

Copying The Right Dependencies With The Maven Dependency Plugin

I’ve been playing with Hadoop recently and ran into an issue with the Maven dependency plugin copying all the JAR files from all the scopes into my lib folder. No problem I thought, you can exclude scoped dependencies with the excludeScope configuration parameter, so I set that to provided but this still left the test dependencies being copied. As you can’t set two excludeScope elements and the one element you can set only takes a single scope, this is a bit of an issue.

It turns out that if you want to exclude dependencies from both the test and provided scopes, you need to exclude the provided scope and include the runtime scope. So your plugin snippet becomes something like:


This means that your lib folder isn’t polluted with your test JARs like JUnit, Hamcrest and Mockito but more importantly without all of the Hadoop dependencies. Which all means that your Hadoop standalone mode classpath for testing out those MapReduce jobs isn’t full of unnecessary clutter.