Map-Reduce and Data Parallelism
Some Machine Learning problems are just too big to run on one machine, sometimes maybe you just have such a large amount of data, (for instance you have 100 million training examples) that you would not want to run all that through a single computer, irrespective of what algorithm you are using. To combat this problem, a different approach to large scale Machine Learning known as the “Map-Reduce” approach, was developed by Jeffrey Dean and Sanjay Ghemawat. With the idea of Map-Reduce we would be able to scale learning algorithms to large machine learning problems, much larger than it is possible with Batch Gradient Descent or Stochastic Gradient Descent.
How it works:
Let us suppose that we have 10 million (10M) examples in our training set and we want to fit a linear regression model or a logistic regression model. Our Batch Gradient Descent learning rule gives us:
The Batch Gradient Descent learning rule has these 10M and the sum from i equals 1 through 10M, i.e all my training examples, is a computationally expensive step.
This is where Map-Reduce comes to play. Let us say I want to denote my training examples by means of the (X,Y) pairs in this box.
Here m=10M.
In Map-Reduce we split the training set into convenient number of subsets. Assume that we have 10 computers in the lab to run in parallel on my training set, so we shall split the data into 10 subsets. So my datasets now would look like:
Each of the subset has 1M examples for 10 different machines. Each of these 10 machine will now run the Batch Gradient Descent learning rule for their respective 1M examples.
For the first machine with the first 1M examples we are going to compute a variable which is equal to the gradient for the first 1M examples.
Similarly we are going to take the second subset of my data and send it to the second machine, which will use training examples (1M+1) through 2M and we will compute a similar variable.
The rest of the 8 machines will use the remaining 8 subsets of my training set. Hence, now each machine has to sum over 1M examples instead of 10M and so has to do only 1/10th of the work, thus would presumably be almost 10 times faster.
After all the machines have computed the respective gradients, they are sent to a centralized master server.
The combined equation is:
What this equation is doing is exactly same as the Batch Gradient Descent, when we have a centralized master server that takes each of the results from the 10 machines and adds them up.
To generalize, suppose we want to split our work among “b” number of machines, then the general picture would be :
We split the training set as evenly as possible into the “b” subsets and send them to “b” different computers. Each of the “b” computers thus computes a summation over only (1/b)th of the dataset. The results from each of the computers are taken and sent to a centralized master server where the result is combined together.
So, earlier with Batch Gradient Descent the bulk of the work was computing the sum from i = 1 through m. Now since each of the “b” computers does just (1/b)th of the work parallelly, potentially we can get a speed up by “b” times, however due to network latencies and cost of network communication sending data back and forth, the speed up is slightly less than “b” times.
Now, we must know if our learning algorithm can be expressed as a summation over my training set or not. It turns out that many learning algorithms including linear regression, logistic regression, can be expressed as computing sums of functions over the training set. Whenever the bulk of the work can be expressed as the sum of the training set, the Map-Reduce works well for scaling the algorithm through very large datasets by taking learning algorithms and expressing them in terms of computing sums of functions over training set. Map-Reduce technique can be used to parallelize other learning algorithms as well, such as the advanced optimization algorithms like conjugate gradient or LBFGS.
We have talked about network latencies. Now, how can they be minimized? Sometimes Map-Reduce can be applicable to a single computer with multiple CPUs or CPUs with multiple computing cores.
Say we have a computer with 4 computing cores. So we may split the training set and send the subsets to different cores within a single computer and divide the work load.
Each of the cores carry out the sum over one quarter of our training set and the individual results are combined to get the summation over the entire training set. Parallelizing across a single machine rather than multiple machines minimizes network latencies and it becomes much less of an issue.
Conclusion:
So Map-Reduce approach to parallelizing by splitting data across multiple machines leads to speed up the learning algorithm to a great extent and is very useful for handling very large datasets. Today there are many open source implementations of Map-Reduce, many uses in open source system called Hadoop where we can use this idea to parallelize learning algorithms and get them to run on far larger datasets than it is possible using a single machine.