MapReduce in MongoDB

Tweet about this on TwitterShare on LinkedIn0Share on Google+1Share on Reddit0Buffer this pageFlattr the authorEmail this to someonePrint this page

MapReduce in MongoDB

MapReduce in MongoDB tutorial

MapReduce in MongoDB Getting started

MapReduce in MongoDB Example

MapReduce in MongoDB with Node js

In this post, we will take a look at performing MapReduce operations on JSON documents present in MongoDB. We will generate dummy data using dummy-json, a  node package and we will use Mongojs another node package to run MapReduce jobs on that data from our Node application.

For a quick sneak peak, take a look at this runnable (click on the run button).

Contents

You can find the complete code here.

What is MongoDB?

MongoDB is a NoSQL database. Unlike MySQL or MSSQL or Oracle DBs, here database have collections instead of tables. We have documents in collections insteads of rows in a table. And best of all, all the documents are stored as JSON. You can know more about MongoDB here.

You can install mongoDB locally from here.

If you have never worked with MongoDB before, you can remember the following commands to navigate around and perform basic operation

Command Result
mongod  will start the MongoDB service
mongo  will step you inside the MongoDB shell (when run in a new terminal while Mongod is running)
show dbs  will show the list of databases
use <<database name>>  will step you inside the database
show collections  will show the list of collections once you are inside the database
db.collectionName.find()  will show all the documents in that collection
db.collectionName.findOne()  will show the first document
db.collectionName.find().pretty()  will pretty print the JSON data in console
db.collectionName.insert({key : value})  will insert a new record
db.collectionName.update({condition : value}, {$set : {key:value}}, {upsert : true})  will update a record with the given condition & sets the required value. If upsert is true a new document will be created if no documents with matching condition are found
db.collectionName.remove({})  will remove all the documents in that collection
db.collectionName.remove({key : value})  will remove the documents matching the condition

You can learn more about MongoDB here.

What is MapReduce?

It is very essential that you get an understanding as how a MapReduce job works. Without this clarity, you may not really achieve the output you are expecting while running these jobs.

From Mongodb.org

Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. For map-reduce operations, MongoDB provides the mapReduce database command.

In very simple terms, the mapReduce command takes 2 primary inputs, the mapper function and the reducer function.

A Mapper will start off by reading a collection of data and building a Map with only the required fields we wish to process and group them into one array based on the key. And then this key value pair is fed into a Reducer, which will process the values.

Ex: Let’s say that we have the following data

And we want to count the price for all the items with same name. We will run this data through a Mapper and then a Reducer to achieve the result.

When we ask a Mapper to process the above data without any conditions, it will generate the following result

Key Value
foo [9,12]
bar [8]
baz [3,5]

That is, it has grouped all the data together which have a similar key, in our case a name. Then these results will be sent to the Reducer.

Now, in the reducer, we get the first row from the above table. We will iterate through all the values and add them up. This will be the sum for first row. Next, the reducer will receive the second and it will do the same thing, till all the rows are completed.

The final output would be

Name Total
foo 21
bar 8
baz 8

So now you can understand why a Mapper is called a Mapper (because, it will create a map of data) & why a Reducer is called a Reducer (because it will reduce the data that the mapper has generated to a more simplified form).

If you run a couple of examples, you will get an idea as how this works. You can read more about MongoDB MapReduce here.

Set Up a Project

As you have seen earlier, we can run queries directly in the mongo shell and see the output. But. for these examples, to keep things more tutorial-ish, We will build a node project and then run the commands.

Mongojs

We will be using mongojs (a node package for interacting with MongoDB),to write our MapReduce commands and execute them. You can run the same code in the mongo shell directly and see the same results. You can read more about mongojs here.

Dummy-json

We will use dummy-json (a Node utility that allows you to generate random JSON data using Handlebars templates) to set up a few thousand sample JSON documents. You can find more information on dummy-json here. Then we will run MapReduce commands on top of it to generate some meaningful results.

So lets get started.

First you need Node js to be installed. You can find details here, Next, Create a new folder named mongoDBMapReduce. Then open a terminal/prompt here. Now we will create a package.json to store our project details. Run,

npm init

and fill it as (or whatever you like)

Screen Shot 2014-04-23 at 3.20.43 pm

Next, we will add the project dependencies. Run

npm i mongojs --save-dev

npm i dummy-json --save-dev

This will take care of installing dependencies and adding them to our package.json.

Generate Dummy Data

Next, we are going to generate dummy data using the dummy-json module. Create a new file named dataGen.js at the root of the project. We will keep the data generation logic in a separate file. In future, if you need to add more data, you can run this file.

Copy the below contents to dataGen.js

Lines 1 to 4, we include all the required packages.

Line 2, we create a new database named mapReduceDB. Inside that, we create a new collection named sourceData, if either of them do not exist already.

Lines 6 to 23 is a Handlebar helper. You can know more about that on the dummy-json page

Lines 27, 28, we read a schema.hbs file (which we will create in a moment) & the parse it to generate the sample JSON.

Line 32, we clean up the old data before dumping new data, Comment this out, incase you want to append to the existing collection

Line 36, insert the generated data to the DB

Next, create a new file named schema.hbs at the root of the project. This will consist of the schema that will constitute one JSON document. Paste the below contents into it

Do notice on line 2, we are going to generate 9999 documents. Thats is it, we are all set to generate some data.

Open a new terminal/prompt and run

mongod

This will start the MongoDB service. Now back to our other terminal/prompt, run

node dataGen.js 

 The result should be

Screen Shot 2014-04-23 at 3.43.35 pmKill the node program by pressing ctrl + c.

To verify, you can open up a new terminal/prompt, run mongo  command and check the data. You should see
Screen Shot 2014-04-23 at 3.45.49 pm

Make sense of the data

Okay, we have dumped in 9999 documents of user data. Let try and make sense of it.

Example 1 : Get the count of Males and Females

First, create a new file named example1.js at the root of the project. We will write a MapReduce job, to fetch the count of Males and Females.

Mapper Logic

The only thing we are expecting from Mapper is that, it will extract the gender as key and value as 1. One because, every user is either a male or a female. So the output of the mapper will be

Key Value
Male [1,1,1,1,1,1,1,1…..]
Female [1,1,1,1,1,1,1,1,1,….]

Reducer Logic

In reducer, we get the above 2 rows. All we need to do is sum up all the values for one row, that will result in the sum of that gender. And the final output of Reducer would be

Key Value
Male 5031
Female 4968

Code

Now, lets write some code to achieve this. In example1.js, first we will require all our dependencies.

Do notice on Line 2, the first argument is the name of the database and second argument is an Array of collection that we are going to query. example1_results is the collection which we are going to generate with our result.

Next, Lets add the mapper and reducer functions

On line 2, this will be populated with the current document. And this.gender  will be either a male or a female, which would be our key. And emit() will push the data to a temporary hash table, that will store the mapper results.

On line 5, we simply add up add all the values for a gender.

Finally, add the logic to execute the mapReduce

On line 5, we set the output collection name and on line 9, we will fetch the results from example1_results collection and display it.

Back to terminal/prompt and run

node example1.js

and the result will be

Screen Shot 2014-04-23 at 4.27.22 pm

My count may not match with yours, but the sum of males and females should be 9999 (Duh!).

Mongo Shell code

If you want to run the above in mongo shell, you can do by pasting the following into the terminal/prompt

And you should see

Screen Shot 2014-04-23 at 4.31.19 pmSimple right?

Example 2 : Get the Eldest and Youngest Person in each gender

For this example, create a new file named example2.js at the root of the project. Here, we will group all the users based on gender & pull out the eldest and youngest in each gender.  A bit more complex than the earlier example.

Mapper Logic

In mapper, we will return the gender as key and we will return an object as value. The object will hold the user’s age and user’s name. Age will be used for calculation where as the name is only for display purposes.

Key Value
Male [{age : 9, name : John}, {}, {} ,{}…]
Female [{age : 19, name : Rita}, {}, {} ,{}…]

Reducer Logic

Our reducer will be a bit more complex than the last example. Here we will perform a check on all the ages corresponding to a gender and sort the based on eldest or youngest. And the final result should look something like

Key Value
Male {'min':{'name':'Haydee Milligan','age':1},'max':{'name':'Darrell Sprowl','age':99}}
Female {'min':{'name':'Cory Hollis','age':1},'max':{'name':'Shea Mercer','age':99}}

Code

Now, open example2.js and paste the below code.

On line 6, we build an object and send it as part of the value. Lines 13 to 18, we iterate through all the objects and check if the current value object’s age is greater than the previous or less and update the res.max value. And similarly, the min value. FInally on line 27, we push the result set into a new collection named example2_results. 

To run this example, back to terminal/prompt and run

node example2.js 

And you should see something like

Screen Shot 2014-04-23 at 9.00.09 pmSince our dataset is huge and there is a high chance that all the numbers from 1 to 99 are used. You can change this in schema.hbs line no 9. Then re run the dataGen.js. Now, you can run the above example and check the values.

Example 3 : Count the number of users in each hobby

In our final example, we will see how many users have similar hobbies. For that, lets first create a new file named example3.js at the root of the project. Data for one user would be

Screen Shot 2014-04-23 at 8.34.35 pmAs you can see, every user has a list of hobbies separated by comma. We will find out how many users have Acrobatics as a hobby and so on.

Mapper Logic

Our mapper is a bit complex for this scenario. We will emit a new key value pair for each hobby of a user. This way, we will fire 1 count for each hobby per user. By the end of the mapper, we will end up with something like

Key Value
Acrobatics [1,1,1,1,1,1,….]
Meditation [1,1,1,1,1,1,….]
Music [1,1,1,1,1,1,….]
Photography [1,1,1,1,1,1,….]
Papier-Mache [1,1,1,1,1,1,….]

Reducer Logic

Here, we simply count each of the values for a hobby. And finally we will have

Key Value
Acrobatics  6641
Meditation  3338
Music  3338
Photography  3303
Papier-Mache  6661

Code

Do notice on lines 7 to 9, we iterate through each hobby and emit one count of it. Lines 13 to 18 can replaced with a simple   Array.sum(values) , but this is another way of doing the same. And the finally we run the job and the result would be

Screen Shot 2014-04-23 at 8.59.23 pm

If you did notice, the collection will get overridden when you run a new query on an existing collection.

So, this is how we can run MapReduce jobs in MongoDB. But do remember that sometimes a simple query can get the job done.


Thanks for reading! Do comment.
@arvindr21

Tweet about this on TwitterShare on LinkedIn0Share on Google+1Share on Reddit0Buffer this pageFlattr the authorEmail this to someonePrint this page
  • KristinaGronis .

    Seems like the tutorial is broken?
    After running node dataGen.js I get this error in my terminal. Double checked that everything in both Schema.hbs and dataGen.js is exactly as written… What is wrong?

    Begin Parsing >>

    /Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/helpers/helper-missing.js:19

    throw new _exception2[‘default’](‘Missing helper: “‘ + arguments[arguments.length – 1].name + ‘”‘);

    ^

    Error: Missing helper: “number”

    at Object. (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/helpers/helper-missing.js:19:13)

    at eval (eval at createFunctionContext (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/compiler/javascript-compiler.js:254:23), :10:70)

    at Object.prog [as fn] (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/runtime.js:219:12)

    at Object.helpers.repeat (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/dummy-json/lib/helpers.js:142:27)

    at Object.eval (eval at createFunctionContext (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/compiler/javascript-compiler.js:254:23), :6:89)

    at main (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/runtime.js:173:32)

    at ret (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/runtime.js:176:12)

    at ret (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/handlebars/dist/cjs/handlebars/compiler/compiler.js:525:21)

    at Object.dummyjson.parse (/Users/krgpth/www/mean-apps/mongoDBMapReduce/node_modules/dummy-json/index.js:26:38)

    at Object. (/Users/krgpth/www/mean-apps/mongoDBMapReduce/dataGen.js:21:24)

    • Knowapp

      As a temporary solution, I removed all of the handlebars template items with “number” as a requirement. I assume I will figure out these features as I go through the tutorial and be able to add them back in and solve the problem shortly.

      Example schema.hbs:

      [
      {{#repeat 9999}}
      {
      “id”: {{index}},
      “name”: “{{firstName}} {{lastName}}”,
      “email”: “{{email}}”,
      “work”: “{{company}}”,
      “dob” : “{{dob}}”,
      “age”: {{number 1 99}}, <— DELETE THIS LINE
      "gender" : "{{gender}}",
      "salary" : {{number 999 99999}}, <— DELETE THIS LINE
      "hobbies" : "{{hobbies}}"
      }
      {{/repeat}}
      ]

      • Bharani Amarnath

        Yes, the codes used in this tutorial is a bit incompleted, but with little modifications, the code works good on version 3.0

        Might try this for data generation –>

        var mongojs = require(‘mongojs’);
        var db = mongojs(‘mapreducedb’,[‘sourceData’]);
        var fs = require(‘fs’);
        var dummyjson = require(‘dummy-json’);

        var helpers = {
        name: function(){
        var firstnames = [“Alan”,”Bradley”,”Chris”,”Donald”,”Frank”];
        var lastnames = [“Goodman”,”Stewart”,”Harper”,”Jackson”,”Smith”];
        var rfn = firstnames[Math.floor(Math.random()*firstnames.length)];
        var rln = lastnames[Math.floor(Math.random()*lastnames.length)];
        var rn = rfn + ” ” + rln;
        return rn;
        },
        age: function(){
        var max = 99;
        var min = 18;
        return Math.floor(Math.random()*(max-min+1)+min);
        },
        gender: function(){
        return “” + Math.random() > 0.5 ? ‘male’ : ‘female';
        },
        dob: function(){
        var start = new Date(1950,0,1),
        end = new Date();
        return new Date(start.getTime() + Math.random() * (end.getTime() – start.getTime()));
        },
        hobbies: function(){
        var hobbiesList = [];
        hobbiesList[0] = [];
        hobbiesList[0][0] = [“Reading”,”Music”,”Movies”];
        hobbiesList[0][1] = [“Reading”,”Gardening”,”Writing”];
        hobbiesList[0][2] = [“Reading”];
        return hobbiesList[0][Math.floor(Math.random()*hobbiesList[0].length)];
        }
        };

        console.log(“Begin parsing”);

        var template = fs.readFileSync(‘schema.hbs’,{encoding:’utf-8′});
        var result = dummyjson.parse(template,{helpers: helpers});

        console.log(“Begin database insert”);

        db.sourceData.remove(function(argument){
        console.log(“DB cleanup complete”);
        });
        db.sourceData.insert(JSON.parse(result),function(err,docs){
        console.log(“DB insert complete”);
        });

        And modify the template hbs something like this –>

        [
        {{#repeat 9999}}
        {
        “name”:”{{name}}”,
        “age”:{{age}},
        “gender”: “{{gender}}”,
        “dob”: “{{dob}}”,
        “hobbies”: “{{hobbies}}”
        }
        {{/repeat}}
        ]

  • Peter Boot

    What is the correct format for template.hbs ?

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      Correct format as in?

  • Logic Town

    Do you know if it’s possible to invoke a function provided by an npm module inside the mapper or reducer functions?

    For instance

    var mymod = require(‘mymod’);

    var mapper = function(){
    emit(this.key, mymod.apply(this));
    }

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      I have not tried but it should be possible. Did you try it?

  • Logic Town

    I reckon the functions such as

    db.example3_results.find(function (err, docs) {
    if(err) console.log(err);
    console.log(docs);
    });

    should be passed as a callback to the mapReduce command, otherwise they will be invoked before the job completes.

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      I am not sure if Mongojs API has a callback for mapreduce(), if it does then that is the place to put it. But I think mapreduce() in a blocking way here. Not sure though.

  • Harshavardhan Reddy

    Hey Aravind,
    I want to do with multiple collections.
    Can you please help with that?
    Thank you.

  • acveer

    Aravind,
    Excellent tutorial. Well-written. It help me get started with mapReduce. I have tried your examples with mongoDB aggregate framework & thought these might help someone. My intention here is share what I have learned and to seek expertise on the aggregation framework.
    Below queries can be run on mongo shell to get the same results with aggregate framework.
    Example1 :
    db.sourceData.aggregate( {$group: {_id: “$gender”, total: {$sum: 1}}});

    Example2 was a little complicated for me, still learning/trying. If you know how to achieve this, please help.

    Example3:

    The dummy-json template + custom helper logic above is returning hobbies like below:
    “hobbies” : “[Acrobatics,Meditation,Music]”
    which was not a valid array for using $unwind in the aggregation framework. So I changed the template(schema.hbs) for hobbies from
    “hobbies” : “{{hobbies}}”
    to
    “hobbies” : [“{{hobbies}}”]
    Which resulted in valid array, but with all the hobbies comma seperated (not ideal, but will work for the example)
    “hobbies” : [ “Painting,Cooking,Reading”]

    so the below aggregate query:
    db.sourceData5.aggregate([
    {$unwind: “$hobbies”},
    {$group: {_id:”$hobbies”, total: {$sum: 1}}}
    ])
    resulted in:
    { “_id” : “Papier-Mache”, “total” : 3313 }
    { “_id” : “Acrobatics,Meditation,Music”, “total” : 3370 }
    { “_id” : “Acrobatics,Photography,Papier-Mache”, “total” : 3316 }

    Do you know how to generate a valid sub-array with dummy-JSON like below:
    “hobbies” : [ “Painting”,”Cooking”,”Reading”]

    It took me sometime, but realized slowly how powerful and handy mongodb aggregation framework is.
    Thank you again for such an awesome effort.

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      Hello acveer. Thanks! And Thanks for sharing your solutions.

      I guess you can tweak the dataGen.js to give you the array you want from inside the hobby(). Let me know if it works.

      A few days ago I was planning to do a comparison between AF and MR, never got to it though. Did you try anything in that space?

      Thanks.

      • acveer

        Arvind,

        I have not done enough MP or AF for the detailed comparision, but here is what I understood so far.

        1. AF is much faster than MP.
        2. Both are targeted to be used in batch operations in background mode (not real-time). (so convenience/control ).
        3. Both have respective limitations in terms of document size, max documents in process (100MB) and sharding support.
        4. MongoDB 2.6 has more handy features added for AF.

        I could not figure out how to insert hobbies array by changing the template or helpers function. The array needs to be this way “”hobbies” : [ “Painting”,”Cooking”,”Reading”]. So I just inserted empty hobbies array from datagen.js and written another script to update hobbies array separately.

        Will post solution for example 2, once I make progress.

        • http://thejackalofjavascript.com/ Arvind Ravulavaru

          Thanks for sharing your findings. I will also take a look at hobbies array and get back.

  • rajkumar

    Hi i have json like this.

    “marks”:{
    “sem1″ :{
    “mark1″:10,
    “total”:100
    },
    “sem2″:{
    “mark2″:20,
    “total”:200
    },
    “sem3″:{
    “mark2″:30,
    “total”:300
    }
    }

    I need result like

    mark total sem

    10 100 sem1
    20 200 sem2
    30 300 sem3

    how can i achive above format using monogodb query.query is jaspersoft related means very useful

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      I have not worked on jaspersoft so I am not sure how the query needs to written. If you want to return an object and compare a value refer example 2 in the post. — Thanks.

  • syd

    and thanks for your quick response.

  • syd

    can you explain a little briefly please, as i am doing a project to calculate the time taken by mongoDB and hadoop in mapReduce algorithm when they store or retrieve different types of bulk data.

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      I have not dug deeper into the algorithms, so I am not aware of it. You can look for more info here : https://github.com/mongodb

      • syd

        Thank you

  • syd

    what are the different techniques to find the time taken by my query to fetch output in mapReduce

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      AFAIK, there is no official benchmarking tool for MongoDB. You can track the time taken from within the node application. Try this : http://blog.nodejs.org/2012/04/25/profiling-node-js

      Or you can try

      This should give a fair idea.

  • sppericat

    This is nice, but why don’t you use the the aggregation framework instead ?

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      Thanks sppericat,

      No argument there. You can use an aggregation framework to do the same. I wanted to throw some light on aggregating data using MapReduce.

      IMO, MapReduce is a verbose version of the aggregation framework & is also an alternative. The only key difference I see between MapReduce & Aggregation Framework is the built in pipe operators like $geoNear. $sum, $gte etc. In a MapReduce paradigm, you end up writing these on your own.

      Thanks,
      Arvind.