July 1, 2016 by Tarun Saxena | 0 comments
AWS , Database , DevOps , Technology
MongoDB is an efficient, reliable and fast processing database for the applications which generate the data in schema-free manner.For high-availability of data in MongoDB, we use MongoDB replicas.We generally face some issues in managing, syncing the cluster which doesn’t let us achieve full flavour of Mongo replication.In some cases, we end up loosing some data or the application starts suffering in terms of failovers.Sometimes we face issues in restarting the whole cluster after an improper shutdown. This blog is a reference of how to troubleshoot MongoDB replicas in case of some common issues.
Use Case :
Restarting MongoDB replicas after you have turned off the cluster for some time (In QA or Load Environment).Generally in load environment, we don’t want my replicas to be up and running as they get hardly used by the application.They are just needed to perform load testing time to time.Below are the tips which you can use to make your shutdown and startup clean while managing MongoDB replicas.
The most important step is to shutdown Mongo cleanly leaving nothing in journal files when you are thinking of shutting down your cluster.You can use shutdown option to mongo process to cleanly shutdown the MongoDB process. Shutdown command: mongodb-linux-x86_64-ubuntu1404-3.2.7/bin/mongod port 27017 dbpath mongo_data shutdown You can also shutdown a member from its mongo shell: db.shutdownServer() Turn off the secondary members first and then primary member.After turning off both the secondary members, primary will itself become as secondary as no member is available for election. To restart the cluster again, start each MongoDB process one by one after checking all the permissions (eg. logfiles and Mongodb data directory). The process should not be painful if mongo shutdown was done in a proper way. Let every member perform the syncing process.Don’t start other mongo untill the existing ones have reached the state of waiting for connections. The cluster should come up without any problems.If for some cases, a member not coming up in secondary state as it is unable to catch up the primary and primary member keeps on overwriting the oplog files, then you can perform one of two steps below:
a) Delete the whole data directory after taking backup and perform an initial sync (automatic feature of Mongo) but it will take time if the data is huge.
b) Take point in time snapshot of the volume or file system from other members and replace the data directory of suffered mongo and start the mongod process.This is to be noted that the backup files must be recent so that the mongo can catch up the oplogs.This process is faster than the first one. Why Lag Is Harmful:
Replication lag means the difference between the time of an operation which is performed on the primary member and the time of the same operation performed on the secondary instance. If replication lag is more means primary’s data is not being replicated to secondary members appropriately which is an alarming situation.If in this case, primary goes down, then it will be a big problem for the application. If there is an excessive replication lag then the secondary members need to be re-sync with the primary and during the re-sync the secondary members become unavailable for the application. Read operations on the cluster may result in inconsistent data. Why Lag Occurs:
Some of the reasons of increased replication lag are:
The secondary members are running out of capacity in terms of memory, I/O and CPU. Huge write operations on the primary member leading secondary members unable to replay the oplogs. Index building on the primary member can lead to blocking of all operations on secondary for that time. Locking secondary members for some important reasons like backup.At that time secondary won’t do any write operations on itself which can increase the lag. One or more secondary is dead and has not come up for a long time. Keeping An Eye On The Replication Lag:
There are two basic MongoDB commands for viewing replication statistics in MongoDB: rs.printReplicationInfo() prod-mongo-replset:PRIMARY> rs.printReplicationInfo()configured oplog size: 10240.003829956055MB
log length start to end: 258803secs (71.89hrs)
oplog first event time: Mon Jun 27 2016 18:16:36 GMT+0000 (UTC)
oplog last event time: Thu Jun 30 2016 18:09:59 GMT+0000 (UTC)
now: Thu Jun 30 2016 18:09:59 GMT+0000 (UTC)
It shows the configured/default oplog size which the replication cluster is using currently.The date range shows the start and end date of the MongoDB operations stored in the oplogs.The duration (71.89 hours) shows that the oplogs can store the data of at most 72 hours, if my secondary is unavailable for three days.The oplog size should be set as per the longest downtime which you can expect of your secondary members.
rs.printSlaveReplicationInfo() prod-mongo-replset:PRIMARY> rs.printSlaveReplicationInfo()source: mongo2.xxxxxx:27017
syncedTo: Thu Jun 30 2016 18:09:48 GMT+0000 (UTC)
1 secs (0 hrs) behind the primary
source: mongo3.xxxxxx:27017
syncedTo: Thu Jun 30 2016 18:09:49 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
It shows the time delay between the secondary members with respect to primary
rs.status()It shows the in-depth details of the replication cluster.We can gather enough information about the replication cluster by these commands.
prod-mongo-replset:SECONDARY> rs.status();{
"set" : "prod-mongo-replset",
"date" : ISODate("2016-06-30T18:17:54.725Z"),
"myState" : 2,
"term" : NumberLong(-1),
"syncingTo" : "mongo1.xxxxxx:27017",
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "mongo1.xxxxxx:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 6732256,
"optime" : Timestamp(1467310674, 562),
"optimeDate" : ISODate("2016-06-30T18:17:54Z"),
"lastHeartbeat" : ISODate("2016-06-30T18:17:54.455Z"),
"lastHeartbeatRecv" : ISODate("2016-06-30T18:17:54.596Z"),
"pingMs" : NumberLong(1),
"electionTime" : Timestamp(1460578491, 1),
"electionDate" : ISODate("2016-04-13T20:14:51Z"),
"configVersion" : 9
},
{
"_id" : 1,
"name" : "mongo2.xxxxxx:27017",
"health"