Today, in the second post in a series on big data and data mining, I’ll be discussing MapReduce, a strategy for handling large amounts of data quickly by exploiting the power of many computers working in parallel.
The original implementation of MapReduce, along with the name, came out of Google. MapReduce originally referred to the proprietary technology Google used to handle the huge quantities of data generated by crawling the World Wide Web. As the ideas behind the technique became better known, other implementations, such as Hadoop, have been created and are available to the world at large.
MapReduce works by first reading in the input and breaking it into smaller parts, and distributes these to the computers that will actually be performing the data manipulation. These machines perform the desired computation, called the comparison function, and pass the results back to the master computer. This step is known as the Map step.
After the data has been mapped to the compute machines, and the calculations are complete, the Reduce step is carried out. This involves taking the information returned by the compute nodes and converting it into the output, which is essentially the answer to the problem that was to be solved.
Since the problem is broken up into many sub-problems, the problem sizes are very scalable. A small problem might be run on a few computers, while a very large problem could be run on thousands. Due to the way the problems are split up, the computers running may be anywhere from the same room (or cluster), or in datacenters spread across the globe.
An excellent example of how this works would be a search for a specific word in a huge collection of documents. This sort of situation is the problem that created MapReduce in the first place, web search. In this situation, the Map step might be to break the collection of documents being searched into roughly equal sized pieces that are then distributed to each worker computer available. On those workers, the documents would be searched for the term. Once the workers were done, returning a list of documents that contained the term, the Reduce step would then take these lists and sort them to display, perhaps based on relevance to the topic. The end result is a lot of data is condensed down to just the relevant bits.
The nature of the MapReduce approach does limit its application to certain types of problems. For example, the algorithm doesn’t handle communication between compute nodes very well, and so it is not well suited to problems that require knowing what’s happening in other computations.
MapReduce has faced criticism for being relatively unsuited to problems that involve a lot of relational data. For such purposes, relational database management systems are usually better.
In the next article in this series, I’ll look at the best known implementation of MapReduce, Hadoop.