11 April 2013

Feed Hadoop with a Large Amount of Images for Parallel Processing

We are investigating how to use Hadoop to process a large amount of images in parallel using our image Toolbox in the NeCTAR cloud. One technical requirement is to find out how to feed Hadoop with a set of binary image files?

Hadoop was originally developed for text mining. As a result, the whole design is based on <key, value> pairs as input and output. It is straightforward to use it for text mining for rich software harnesses, examples etc., but not for binary data, such as images. This blog discusses two approaches to this requrement as follows:

1. Using Hadoop SequenceFile


A sequenceFile is a flat file consisting of binary <k, v> pair, which can be directly used as Hadoop's inputs.

Here is an example for generating a SequenceFile from a set of image files in Java:


Here is the execution of the above code:


Then, the generated file can be used to feed Hadoop as <k, v> pair as input for paralel processing.

2. Using HDFS (Hadoop Distributed File System)'s URL:


Instead of feeding Hadoop with data contents directly, we can feed Hadoop a file list, where lists all files HDFS' URI as v of <k, v>. Each mapper uses the assigned <k, v> to load the corresponding data contents (e.g. images) to process as shown as follows:



Here is a simple comparsion between the above two approaches: