MapReduce in MongoDB9 min read

MapReduce in MongoDB

MapReduce in MongoDB tutorial

MapReduce in MongoDB Getting started

MapReduce in MongoDB Example

MapReduce in MongoDB with Node js

In this post, we will take a look at performing MapReduce operations on JSON documents present in MongoDB. We will generate dummy data using dummy-json, a  node package and we will use Mongojs another node package to run MapReduce jobs on that data from our Node application.

For a quick sneak peak, take a look at this runnable (click on the run button).

Contents

You can find the complete code here.

What is MongoDB?

MongoDB is a NoSQL database. Unlike MySQL or MSSQL or Oracle DBs, here database have collections instead of tables. We have documents in collections insteads of rows in a table. And best of all, all the documents are stored as JSON. You can know more about MongoDB here.

You can install mongoDB locally from here.

If you have never worked with MongoDB before, you can remember the following commands to navigate around and perform basic operation

Command Result
mongod  will start the MongoDB service
mongo  will step you inside the MongoDB shell (when run in a new terminal while Mongod is running)
show dbs  will show the list of databases
use <<database name>>  will step you inside the database
show collections  will show the list of collections once you are inside the database
db.collectionName.find()  will show all the documents in that collection
db.collectionName.findOne()  will show the first document
db.collectionName.find().pretty()  will pretty print the JSON data in console
db.collectionName.insert({key : value})  will insert a new record
db.collectionName.update({condition : value}, {$set : {key:value}}, {upsert : true})  will update a record with the given condition & sets the required value. If upsert is true a new document will be created if no documents with matching condition are found
db.collectionName.remove({})  will remove all the documents in that collection
db.collectionName.remove({key : value})  will remove the documents matching the condition

You can learn more about MongoDB here.

What is MapReduce?

It is very essential that you get an understanding as how a MapReduce job works. Without this clarity, you may not really achieve the output you are expecting while running these jobs.

From Mongodb.org

Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. For map-reduce operations, MongoDB provides the mapReduce database command.

In very simple terms, the mapReduce command takes 2 primary inputs, the mapper function and the reducer function.

A Mapper will start off by reading a collection of data and building a Map with only the required fields we wish to process and group them into one array based on the key. And then this key value pair is fed into a Reducer, which will process the values.

Ex: Let’s say that we have the following data

And we want to count the price for all the items with same name. We will run this data through a Mapper and then a Reducer to achieve the result.

When we ask a Mapper to process the above data without any conditions, it will generate the following result

Key Value
foo [9,12]
bar [8]
baz [3,5]

That is, it has grouped all the data together which have a similar key, in our case a name. Then these results will be sent to the Reducer.

Now, in the reducer, we get the first row from the above table. We will iterate through all the values and add them up. This will be the sum for first row. Next, the reducer will receive the second and it will do the same thing, till all the rows are completed.

The final output would be

Name Total
foo 21
bar 8
baz 8

So now you can understand why a Mapper is called a Mapper (because, it will create a map of data) & why a Reducer is called a Reducer (because it will reduce the data that the mapper has generated to a more simplified form).

If you run a couple of examples, you will get an idea as how this works. You can read more about MongoDB MapReduce here.

Set Up a Project

As you have seen earlier, we can run queries directly in the mongo shell and see the output. But. for these examples, to keep things more tutorial-ish, We will build a node project and then run the commands.

Mongojs

We will be using mongojs (a node package for interacting with MongoDB),to write our MapReduce commands and execute them. You can run the same code in the mongo shell directly and see the same results. You can read more about mongojs here.

Dummy-json

We will use dummy-json (a Node utility that allows you to generate random JSON data using Handlebars templates) to set up a few thousand sample JSON documents. You can find more information on dummy-json here. Then we will run MapReduce commands on top of it to generate some meaningful results.

So lets get started.

First you need Node js to be installed. You can find details here, Next, Create a new folder named mongoDBMapReduce. Then open a terminal/prompt here. Now we will create a package.json to store our project details. Run,

npm init

and fill it as (or whatever you like)

Screen Shot 2014-04-23 at 3.20.43 pm

Next, we will add the project dependencies. Run

npm i mongojs --save-dev

npm i dummy-json --save-dev

This will take care of installing dependencies and adding them to our package.json.

Generate Dummy Data

Next, we are going to generate dummy data using the dummy-json module. Create a new file named dataGen.js at the root of the project. We will keep the data generation logic in a separate file. In future, if you need to add more data, you can run this file.

Copy the below contents to dataGen.js

Lines 1 to 4, we include all the required packages.

Line 2, we create a new database named mapReduceDB. Inside that, we create a new collection named sourceData, if either of them do not exist already.

Lines 6 to 23 is a Handlebar helper. You can know more about that on the dummy-json page

Lines 27, 28, we read a schema.hbs file (which we will create in a moment) & the parse it to generate the sample JSON.

Line 32, we clean up the old data before dumping new data, Comment this out, incase you want to append to the existing collection

Line 36, insert the generated data to the DB

Next, create a new file named schema.hbs at the root of the project. This will consist of the schema that will constitute one JSON document. Paste the below contents into it

Do notice on line 2, we are going to generate 9999 documents. Thats is it, we are all set to generate some data.

Open a new terminal/prompt and run

mongod

This will start the MongoDB service. Now back to our other terminal/prompt, run

node dataGen.js 

 The result should be

Screen Shot 2014-04-23 at 3.43.35 pmKill the node program by pressing ctrl + c.

To verify, you can open up a new terminal/prompt, run mongo  command and check the data. You should see
Screen Shot 2014-04-23 at 3.45.49 pm

Make sense of the data

Okay, we have dumped in 9999 documents of user data. Let try and make sense of it.

Example 1 : Get the count of Males and Females

First, create a new file named example1.js at the root of the project. We will write a MapReduce job, to fetch the count of Males and Females.

Mapper Logic

The only thing we are expecting from Mapper is that, it will extract the gender as key and value as 1. One because, every user is either a male or a female. So the output of the mapper will be

Key Value
Male [1,1,1,1,1,1,1,1…..]
Female [1,1,1,1,1,1,1,1,1,….]

Reducer Logic

In reducer, we get the above 2 rows. All we need to do is sum up all the values for one row, that will result in the sum of that gender. And the final output of Reducer would be

Key Value
Male 5031
Female 4968

Code

Now, lets write some code to achieve this. In example1.js, first we will require all our dependencies.

Do notice on Line 2, the first argument is the name of the database and second argument is an Array of collection that we are going to query. example1_results is the collection which we are going to generate with our result.

Next, Lets add the mapper and reducer functions

On line 2, this will be populated with the current document. And this.gender  will be either a male or a female, which would be our key. And emit() will push the data to a temporary hash table, that will store the mapper results.

On line 5, we simply add up add all the values for a gender.

Finally, add the logic to execute the mapReduce

On line 5, we set the output collection name and on line 9, we will fetch the results from example1_results collection and display it.

Back to terminal/prompt and run

node example1.js

and the result will be

Screen Shot 2014-04-23 at 4.27.22 pm

My count may not match with yours, but the sum of males and females should be 9999 (Duh!).

Mongo Shell code

If you want to run the above in mongo shell, you can do by pasting the following into the terminal/prompt

And you should see

Screen Shot 2014-04-23 at 4.31.19 pmSimple right?

Example 2 : Get the Eldest and Youngest Person in each gender

For this example, create a new file named example2.js at the root of the project. Here, we will group all the users based on gender & pull out the eldest and youngest in each gender.  A bit more complex than the earlier example.

Mapper Logic

In mapper, we will return the gender as key and we will return an object as value. The object will hold the user’s age and user’s name. Age will be used for calculation where as the name is only for display purposes.

Key Value
Male [{age : 9, name : John}, {}, {} ,{}…]
Female [{age : 19, name : Rita}, {}, {} ,{}…]

Reducer Logic

Our reducer will be a bit more complex than the last example. Here we will perform a check on all the ages corresponding to a gender and sort the based on eldest or youngest. And the final result should look something like

Key Value
Male {'min':{'name':'Haydee Milligan','age':1},'max':{'name':'Darrell Sprowl','age':99}}
Female {'min':{'name':'Cory Hollis','age':1},'max':{'name':'Shea Mercer','age':99}}

Code

Now, open example2.js and paste the below code.

On line 6, we build an object and send it as part of the value. Lines 13 to 18, we iterate through all the objects and check if the current value object’s age is greater than the previous or less and update the res.max value. And similarly, the min value. FInally on line 27, we push the result set into a new collection named example2_results. 

To run this example, back to terminal/prompt and run

node example2.js 

And you should see something like

Screen Shot 2014-04-23 at 9.00.09 pmSince our dataset is huge and there is a high chance that all the numbers from 1 to 99 are used. You can change this in schema.hbs line no 9. Then re run the dataGen.js. Now, you can run the above example and check the values.

Example 3 : Count the number of users in each hobby

In our final example, we will see how many users have similar hobbies. For that, lets first create a new file named example3.js at the root of the project. Data for one user would be

Screen Shot 2014-04-23 at 8.34.35 pmAs you can see, every user has a list of hobbies separated by comma. We will find out how many users have Acrobatics as a hobby and so on.

Mapper Logic

Our mapper is a bit complex for this scenario. We will emit a new key value pair for each hobby of a user. This way, we will fire 1 count for each hobby per user. By the end of the mapper, we will end up with something like

Key Value
Acrobatics [1,1,1,1,1,1,….]
Meditation [1,1,1,1,1,1,….]
Music [1,1,1,1,1,1,….]
Photography [1,1,1,1,1,1,….]
Papier-Mache [1,1,1,1,1,1,….]

Reducer Logic

Here, we simply count each of the values for a hobby. And finally we will have

Key Value
Acrobatics  6641
Meditation  3338
Music  3338
Photography  3303
Papier-Mache  6661

Code

Do notice on lines 7 to 9, we iterate through each hobby and emit one count of it. Lines 13 to 18 can replaced with a simple   Array.sum(values) , but this is another way of doing the same. And the finally we run the job and the result would be

Screen Shot 2014-04-23 at 8.59.23 pm

If you did notice, the collection will get overridden when you run a new query on an existing collection.

So, this is how we can run MapReduce jobs in MongoDB. But do remember that sometimes a simple query can get the job done.


Thanks for reading! Do comment.
@arvindr21