Hadoop: an overview including what it is and how it works (conceptually).

Hadoop (an overiew)

Please note that most of this document is paraphrased or directly copied from the Hadoop wiki. The goal here is to bring the important concepts into one single place to more easily get your head around Hadoop and what you can do with it. Written on August 15, 2007 by Kate Rhodes ( masukomi @ masukomi.org ).

What is it?
What was that MapReduce thing again?
What is it good for?
Who is it good for?
What are it's requirements

What is it?

Hadoop is an open source Java implementation of Google's MapReduce algorithm along with an infrastructure to support distributing it over multiple machines. This includes it's own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files. When thinking about Hadoop it's important to keep in mind that the infrastructure it has is a huge part of it. Implementing MapReduce is simple. Implementing a system that can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing in an efficient way is not.

The short short version is that Hadoop takes two user specified functions, one to apply to all your "records" and one to summarize the out put of the previous operations. It then spreads those instructions over a cluster of computers which then apply them to chunks of data, that you have hopefully told it to replicate across a cross-section of the cluster.

It is written with large clusters of computers in mind and is built around the following assumptions:
* Hardware will fail.
* Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency.
* Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
* Applications need a write-once-read-many access model. In other words, process log files with it, don't edit them. In fact, one a file is in HDFS you read it, replicate it, or delete it. Period. But, there's nothing to prevent you from creating a new file with the output of a MapReduce run and using that as input to another MapReduce or just as something to retrieve later.
* Moving Computation is Cheaper than Moving Data. A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system.
* The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
* Portability is important. It's written in Java and the HDFS is easily portable between platforms so you can install it on windows, mac, linux, or anything else Java will run under. However, Hadoop is pretty much written with Linux in mind so your mileage may vary, although there appears to have been a fair amount of consideration given to windows users with cygwin installed.

HDFS breaks files down into blocks which can be replicated across it's network (how many times it's replicated it determined by your application and can be specified on a per file basis). This is one of the most important performance features and, according to the docs "...is a feature that needs a lot of tuning and experience." You really don't want to have 50 machines all trying to pull from a 1TB file on a single data node, at the same time, but you also don't want to have it replicate a 1TB file out to 50 machines. So, it's a balancing act. Because of this Hadoop understands the concepts of racks of computers because it's generally less network bandwidth if you stay within your rack. HDFS does not automatically rebalance replicated data when new nodes are added. However, newly created files will likely have their blocks placed on the new nodes. There are several way to rebalance the cluster manually.

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack.

Hadoop installations are broken into three types.
* The NameNode acts as the HDFS master, managing all decisions regarding data replication.
* The JobTracker manages the MapReduce work. It "...is the central location for submitting and tracking MR jobs in a network environment."
* Slaves, which do the grunt work.
If you wanted to, for testing maybe, you could run everything on the same box fairly easily.

You don't have to write your MapReduces in Java but it would probably be easier if you did. HadoopStreaming permits any shell command to be used as a map or reduce function, and Hadoop is also developing C and C++ APIs and a SWIG-compatible pipes API.

As for scalability, the largest current cluster of Hadoop boxes is at Yahoo! and is over 5k boxes but they have noticed some issues once they passed 1000. It is not clear what exactly those "issues" are but, as Hadoop is open source, all of the bugs are managed where people can see them, and hopefully help to fix them.

What was that MapReduce thing again?
From Google's paper on MapReduce:
"MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. "

"...We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all of the values that shared the same key, in order to contribute the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance."

The "map" and "reduce" functions we're talking about are your standard functional programming concepts (from Lisp, Haskell, Scheme, et. all) which can be found in a variety of more common languages like Python and Ruby.

An example from Google's paper on MapReduce:

Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:

map(String key, String value):
   //key: document name
   //value: document contents
   for each word w in value:
      EmitIntermediate(w, "1");

reduce(String key, Iterator values):
   //key: a word
   //values: a list of counts
   int result = 0;
   for each v in values:
      result += ParseInt(v);
   Emit(AsString(result));

The map function emits each word plus an associated count of occurrences (just '1' in this simple example). The reduce function sums together all the counts emitted for a particular word.

For more information on how Map and Reduce operations are carried out in Hadoop read this (after you read the Google paper).

What is it good for?
Conceptually MapReduce is good for taking large sets of the same type of data (logs being a prime example) and applying a simple algorithm to each of those data items (maybe calculating what minute of the day a log record was made), and then doing something more complex, and useful, with the extracted key/value pairs (count up all the hits in each minute for a graph).

Who is it good for?
It's good for anyone with large data sets that need processing and lots of computers to spread that processing over. The key to all this working though, is the ability to break your computations down into conceptually small pieces because MapReduce is very simplistic. However, Stephen Wolfram has repeatedly shown that some amazingly complex things can be accomplished with very simple programs, and that they have unlimited application in everyday systems. So, mostly, leveraging Hadoop requires a different approach to programming. Instead of complex functions you write collections of simple ones that build upon the work of each other.

What are its requirements?
Java 1.5 or higher (although you'll probably have an easier time of it on Linux)
rsync (to manage remote installations)
Disk space (enough to hold your source data and it's replications)
RAM (you can never have enough ram)
A cluster of computers.

If you don't have a spare cluster of computers you can always use Amazon's EC2, although they currently limit you to 20 simultaneous instances. There are already EC2 images pre-configured with Hadoop, so all you'd have to do is boot them up, configure them (there are docs and scripts for doing this) and be willing to pay for the run time (not prohibitive). You'll also want to upload your data to S3 because there's no charge for bandwidth between EC2 and S3 and it's much faster than serving from your box to EC2. Of course, this may not be an option with proprietary data.