Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

MongoDB Performance: Running MongoDB Map-Reduce Operations On Secondaries

$
0
0

Map-reduce is perhaps the most versatile of the aggregation operations that MongoDB supports.

Map-Reduce is a popular programming model that originated at Google for processing and aggregating large volumes of data in parallel. A detailed discussion on Map-Reduce is out of the scope of this article but essentially it is a multi-step aggregation process . The most important two steps are the map stage (process each document and emit results) and the reduce stage (collates results emitted during the map stage).

MongoDB supports three kinds of aggregation operations: Map-Reduce , aggregation pipeline and single purpose aggregation commands . You can use this MongoDB comparison document to see which fits your needs.

In my last post, we saw, with examples, how to run Aggregation pipelines on secondaries. In this post, we will walk through running Map-Reduce jobs on the MongoDB secondary replicas.

MongoDB Map-Reduce

MongoDB supports running Map-Reduce jobs on the database servers. This offers the flexibility to write complex aggregation tasks that aren’t as easily done via aggregation pipelines. MongoDB lets you write custom map and reduce functions in javascript that can be passed to the database via Mongo shell or any other client. On large and constantly growing data sets, one can even consider running incremental Map-Reduce jobs to avoid processing older data every time.

Historically, the map and the reduce methods used to be executed in a single-threaded context. However, that limitation was removed in version 2.4.

Why run Map-Reduce jobs on the Secondary?

Like other aggregation jobs, Map-Reduce too is a resource intensive ‘batch’ job so it is a good fit for running on read-onlyreplicas. The caveats in doing so are:

1) It should be ok to use slightly stale data. Or you can tweak the write concern to ensure replicas are always in sync with the primary. This second option assumes that taking a hit on the write performance is acceptable.

2) T he output of the Map-Reduce job shouldn’t be written to another collection within the database but rather be returned to the application (i.e. no writes to the database).

Let’s look at how to do this via examples, both from the mongo shell and the Java driver.

Map-Reduce on Replica Sets Data Set

For illustration, we will use a rather simple data set: A daily transaction record dump from a retailer. A sample entry looks like:

RS-replica-0:PRIMARY> use test
switched to db test
RS-replica-0:PRIMARY> show tables
txns
RS-replica-0:PRIMARY> db.txns.findOne()
{
"_id" : ObjectId("584a3b71cdc1cb061957289b"),
"custid" : "cust_66",
"txnval" : 100,
"items" : [{"sku": sku1", "qty": 1, "pr": 100}, ...],
...
}

In our examples, we will calculate the total expenditure of a given customer on that day. Thus, given our schema, the map and reduce methods will look like:

var mapFunction = function() { emit(this.custid, this.txnval); } // Emit the custid and txn value from each record
var reduceFunction = function(key, values) { return Array.sum(values); } // Sum all the txn values for a given custid

With our schema established, let’s look at Map-Reduce in action.

MongoDB Shell

In order to ensure that a Map-Reduce job is executed on the secondary, the read preference should be set to secondary . Like we said above, in order for a Map-Reduce to run on a secondary, the output of the result must be inline (In fact, that’s is the only out value allowed on secondaries). Let’s see how it works.

$ mongo -u admin -p pwd --authenticationDatabase admin --host RS-replica-0/server-1.servers.example.com:27017,server-2.servers.example.com:27017
MongoDB shell version: 3.2.10
connecting to: RS-replica-0/server-1.servers.example.com:27017,server-2.servers.example.com:27017/test
2016-12-09T08:15:19.347+0000 I NETWORK [thread1] Starting new replica set monitor for server-1.servers.example.com:27017,server-2.servers.example.com:27017
2016-12-09T08:15:19.349+0000 I NETWORK [ReplicaSetMonitorWatcher] starting
RS-replica-0:PRIMARY> db.setSlaveOk()
RS-replica-0:PRIMARY> db.getMongo().setReadPref('secondary')
RS-replica-0:PRIMARY> db.getMongo().getReadPrefMode()
secondary
RS-replica-0:PRIMARY> var mapFunc = function() { emit(this.custid, this.txnval); }
RS-replica-0:PRIMARY> var reduceFunc = function(key, values) { return Array.sum(values); }
RS-replica-0:PRIMARY> db.txns.mapReduce(mapFunc, reduceFunc, {out: { inline: 1 }})
{
"results" : [
{
"_id" : "cust_0",
"value" : 72734
},
{
"_id" : "cust_1",
"value" : 67737
},
...
]
"timeMillis" : 215,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 909,
"output" : 101
},
"ok" : 1
}

A peek at the logs on the secondary confirms that the job indeed ran on the secondary.

...
2016-12-09T08:17:24.842+0000 D COMMAND [conn344] mr ns: test.txns
2016-12-09T08:17:24.843+0000 I COMMAND [conn344] command test.$cmd command: listCollections { listCollections: 1, filter: { name: "txns" }, cursor: {} } keyUpdates:0 writeConflicts:0 numYields:0 reslen:150 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 1, R: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 0ms
2016-12-09T08:17:24.865+0000 I COMMAND [conn344] query test.system.js planSummary: EOF ntoreturn:0 ntoskip:0 keysExamined:0 docsExamined:0 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 2, R: 1 } }, Collec

Viewing all articles
Browse latest Browse all 6262

Trending Articles