Understanding Map - Reduce
Monday 03 December 2012
Quite a few people seem to be intimidated by the concept of Map-Reduce. As it turns out Map-Reduce is actually quite simple and straightforward when you get to understand the basic principle.
The basic Map-Reduce consists of two steps. I guess you are not going to be very surprised when I tell you that these steps are called Map and Reduce.
The Map process gets the raw data as input and discard anything in the data we are not interested in, basically each input has a corresponding output so we end up with the same number of simplified data elements. Simple right?
And the Reduce process takes the output of the Map process and combines/discards data to produce the result. Again pretty simple right? There is one catch with the Reduce though. It should output exactly the same data structure as it took as its input. The point being is you can run the same Reduce process multiple times if you would like to. I will come back to why you might want to do so.
A simple example.
The following code loads a few orders and does a Map reduce job on them.
The Orders data structure looks like this:
As you can see we have an order with a customer and a number of order lines. Each order line contains the books title, the quantity and the amount it was sold for. During the Map process we just extract the data we are interested in. In this case the books title and the price it was sold for. We store that in the following structure:
The Map Process
So the Map function takes a collection of Orders as input and returns a collection of ReduceTo items.
In fact you could easily rewrite this into a LINQ query like this:
And in fact that is all there is to a Map process, it is just a LINQ select clause. And when you look at it that way it becomes a lot simpler and easier to understand :-)
The Reduce process
With the Map done we can focus on the Reduce part. The important part is that the Reduce takes the result of the Map function both as input and output. As long as we take this into account the function is quite simple to write.
Pretty simple and if we think about it it is really just a grouping. So just as with the Map process we could easily rewrite this as a LINQ group by query like this:
So where the Map process is just a LINQ select the Reduce process is just a LINQ group by with the additional collection that the input type is also the output type. Makes things a lot simpler right :-)
Running the sample code produces output like this.
Advantages of Map-Reduce
We have seen Map-Reduce is just another way of doing a LINQ group by query so what is the big deal? One of the biggest advantage of Map-Reduce is the fact that you can parallelize them. You can do so over multiple threads inside a single process or if you want to scale this out over many machines. The key to this is being able to repeat the Reduce step multiple times.
Suppose we want to analyze a lot of sales data and compute the sales amount per article. We could partition the data into as many partitions as we want. Lets assume that we split the sales orders into three partitions. For each group of orders we could map the sales into a collection of article and sales amount pair where each article could be repeated multiple times. After this is done the Reduce step groups the data so we have a single entry for each article with the total sales amount for that partition. Now when that has been done for each partition we combine the three reduced sets into a single set. This is much like the set after the first Map process and contains duplicate entries for each article so all we need to do is run the same reduce process over this much smaller set of data to get the final result. So in reality it is just divide and conquer where we divide the original data into as many partitions as we want to. In other words, a really easy system for scale out scenarios :-)
This is quite easy to do now with the same Map and Reduce functions we used before. The code below loads 25 different sets of orders and runs the Map and Reduce jobs on each of them. Once that is done it combines the different output sets into one new input and runs another Reduce over that to get the end result.
To combine the different results I created an simple overload of the Reduce function that just combines multiple inputs and calls the original Reduce function like this:
Running this will produce output like below. The white Map/Reduce output is different reduced sets of output from the 25 tasks and the yellow output the the second Reduce done with the output from the first 25 Reduce calls.
Map-Reduce is just another way of doing a LINQ group by but one that is very scalable because of a few simple rules. And once you understand that there is no longer a reason to be intimidated by Map-Reduce and start embracing its power.