Map-reduce is perhaps the most versatile of the aggregation operations that MongoDB supports.
Map-Reduce is a popular programming model that originated at Google for processing and aggregating large volumes of data in parallel. A detailed discussion on Map-Reduce is out of the scope of this article but essentially it is a multi-step aggregation process . The most important two steps are the map stage (process each document and emit results) and the reduce stage (collates results emitted during the map stage).
MongoDB supports three kinds of aggregation operations: Map-Reduce , aggregation pipeline and single purpose aggregation commands . You can use this MongoDB comparison document to see which fits your needs.
In my last post, we saw, with examples, how to run Aggregation pipelines on secondaries. In this post, we will walk through running Map-Reduce jobs on the MongoDB secondary replicas.
MongoDB Map-ReduceMongoDB supports running Map-Reduce jobs on the database servers. This offers the flexibility to write complex aggregation tasks that aren’t as easily done via aggregation pipelines. MongoDB lets you write custom map and reduce functions in javascript that can be passed to the database via Mongo shell or any other client. On large and constantly growing data sets, one can even consider running incremental Map-Reduce jobs to avoid processing older data every time.
Historically, the map and the reduce methods used to be executed in a single-threaded context. However, that limitation was removed in version 2.4.
Why run Map-Reduce jobs on the Secondary?Like other aggregation jobs, Map-Reduce too is a resource intensive ‘batch’ job so it is a good fit for running on read-onlyreplicas. The caveats in doing so are:
1) It should be ok to use slightly stale data. Or you can tweak the write concern to ensure replicas are always in sync with the primary. This second option assumes that taking a hit on the write performance is acceptable.
2) T he output of the Map-Reduce job shouldn’t be written to another collection within the database but rather be returned to the application (i.e. no writes to the database).
Let’s look at how to do this via examples, both from the mongo shell and the Java driver.
Map-Reduce on Replica Sets Data SetFor illustration, we will use a rather simple data set: A daily transaction record dump from a retailer. A sample entry looks like:
RS-replica-0:PRIMARY> use testswitched to db test
RS-replica-0:PRIMARY> show tables
txns
RS-replica-0:PRIMARY> db.txns.findOne()
{
"_id" : ObjectId("584a3b71cdc1cb061957289b"),
"custid" : "cust_66",
"txnval" : 100,
"items" : [{"sku": sku1", "qty": 1, "pr": 100}, ...],
...
}
In our examples, we will calculate the total expenditure of a given customer on that day. Thus, given our schema, the map and reduce methods will look like:
var mapFunction = function() { emit(this.custid, this.txnval); } // Emit the custid and txn value from each recordvar reduceFunction = function(key, values) { return Array.sum(values); } // Sum all the txn values for a given custid
With our schema established, let’s look at Map-Reduce in action.
MongoDB ShellIn order to ensure that a Map-Reduce job is executed on the secondary, the read preference should be set to secondary . Like we said above, in order for a Map-Reduce to run on a secondary, the output of the result must be inline (In fact, that’s is the only out value allowed on secondaries). Let’s see how it works.
$ mongo -u admin -p pwd --authenticationDatabase admin --host RS-replica-0/server-1.servers.example.com:27017,server-2.servers.example.com:27017MongoDB shell version: 3.2.10
connecting to: RS-replica-0/server-1.servers.example.com:27017,server-2.servers.example.com:27017/test
2016-12-09T08:15:19.347+0000 I NETWORK [thread1] Starting new replica set monitor for server-1.servers.example.com:27017,server-2.servers.example.com:27017
2016-12-09T08:15:19.349+0000 I NETWORK [ReplicaSetMonitorWatcher] starting
RS-replica-0:PRIMARY> db.setSlaveOk()
RS-replica-0:PRIMARY> db.getMongo().setReadPref('secondary')
RS-replica-0:PRIMARY> db.getMongo().getReadPrefMode()
secondary
RS-replica-0:PRIMARY> var mapFunc = function() { emit(this.custid, this.txnval); }
RS-replica-0:PRIMARY> var reduceFunc = function(key, values) { return Array.sum(values); }
RS-replica-0:PRIMARY> db.txns.mapReduce(mapFunc, reduceFunc, {out: { inline: 1 }})
{
"results" : [
{
"_id" : "cust_0",
"value" : 72734
},
{
"_id" : "cust_1",
"value" : 67737
},
...
]
"timeMillis" : 215,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 909,
"output" : 101
},
"ok" : 1
}
A peek at the logs on the secondary confirms that the job indeed ran on the secondary.
...2016-12-09T08:17:24.842+0000 D COMMAND [conn344] mr ns: test.txns
2016-12-09T08:17:24.843+0000 I COMMAND [conn344] command test.$cmd command: listCollections { listCollections: 1, filter: { name: "txns" }, cursor: {} } keyUpdates:0 writeConflicts:0 numYields:0 reslen:150 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 1, R: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 0ms
2016-12-09T08:17:24.865+0000 I COMMAND [conn344] query test.system.js planSummary: EOF ntoreturn:0 ntoskip:0 keysExamined:0 docsExamined:0 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 2, R: 1 } }, Collec