MapReduce in MongoDB

In this post, we are going to explain how to write a simple MapReduce job. Before starting out coding, we need to understand what means a MapReduce job from a high level point of view.

A Map operation is an operation that will group values based on the specified key. For instance, this example might be used by Amazon in order to display how good has an author performed in general (we are using the dataset from the Aggregation example):

This Map function is querying the database and grouping by the rating value and author (we are using a composite key). The array that we pass to the Reduce function has the following data structure:

A Reduce operation is an operation that given a key and an array of values, uses those values to return an answer to the specific problem. In our case, let’s define the reduce function as follows:

As we can see, we are just sorting the results of the given array.

We execute the MapReduce job as follows:

This command creates a new collection called “books_ratings” where we save the output of the performed job.

All in all we have just written our first MapReduce job and it was not as difficult as we thought. My recommendation, when writing MapReduce jobs, is to write it first in JavaScript and test, afterwards, you are safe to write it in Java or another language.

The implementation in Java is pretty straightforward:

As we can see, we write the JavaScript function as a simple String and, the reason for doing so, is that it is going to be executed by the JavaScript Engine (V8).

At first sight, we might think that we got an advantage since we did not have to code the function in Java. Nonetheless, in spite of the awesome performance of the V8 JavaScript Engine, the execution of  JavaScript code is not as fast as doing it in plain Java. Furthermore, the guys behind MongoDB use V8 for the JavaScript Engine since v2.4 but, they do not use the whole power that it provides, since V8 allows the execution of multiple threads, while the MapReduce job acts in a single thread. Therefore, if you try to execute more than one MapReduce job “at the same time”, the second job will be queued until the first one finishes.

For the reasons given above, and the recommendations at the MongoDB Conference in Stockholm, you should use the MapReduce jobs just in a couple of selected cases since everything (if not all) you can do with in MapReduce job can be done as well using the Aggregation Framework (explained in this blog here). Therefore, my recommendation would be to use the MapReduce jobs in cases where the dataset is not that big (since the performance is not as good as executing Java code) and, of course, avoid using it for realtime operations.

I hope that everything is clear and simple enough so that you can continue reading the official docs (which are really good, BTW).

Have fun coding!!!

This entry was published on May 9, 2013 at 5:57 pm. It’s filed under Uncategorized and tagged , . Bookmark the permalink. Follow any comments here with the RSS feed for this post.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: