Troubleshooting MongoDB Replication Cluster

July 1, 2016 by Tarun Saxena | 0 comments

AWS , Database , DevOps , Technology
Troubleshooting MongoDB Replication Cluster

MongoDB is an efficient, reliable and fast processing database for the applications which generate the data in schema-free manner.For high-availability of data in MongoDB, we use MongoDB replicas.We generally face some issues in managing, syncing the cluster which doesn’t let us achieve full flavour of Mongo replication.In some cases, we end up loosing some data or the application starts suffering in terms of failovers.Sometimes we face issues in restarting the whole cluster after an improper shutdown. This blog is a reference of how to troubleshoot MongoDB replicas in case of some common issues.

Use Case :

Restarting MongoDB replicas after you have turned off the cluster for some time (In QA or Load Environment).Generally in load environment, we don’t want my replicas to be up and running as they get hardly used by the application.They are just needed to perform load testing time to time.Below are the tips which you can use to make your shutdown and startup clean while managing MongoDB replicas.

The most important step is to shutdown Mongo cleanly leaving nothing in journal files when you are thinking of shutting down your cluster.You can use shutdown option to mongo process to cleanly shutdown the MongoDB process. Shutdown command: mongodb-linux-x86_64-ubuntu1404-3.2.7/bin/mongod port 27017 dbpath mongo_data shutdown You can also shutdown a member from its mongo shell: db.shutdownServer() Turn off the secondary members first and then primary member.After turning off both the secondary members, primary will itself become as secondary as no member is available for election. To restart the cluster again, start each MongoDB process one by one after checking all the permissions (eg. logfiles and Mongodb data directory). The process should not be painful if mongo shutdown was done in a proper way. Let every member perform the syncing process.Don’t start other mongo untill the existing ones have reached the state of waiting for connections. The cluster should come up without any problems.

If for some cases, a member not coming up in secondary state as it is unable to catch up the primary and primary member keeps on overwriting the oplog files, then you can perform one of two steps below:

a) Delete the whole data directory after taking backup and perform an initial sync (automatic feature of Mongo) but it will take time if the data is huge.

b) Take point in time snapshot of the volume or file system from other members and replace the data directory of suffered mongo and start the mongod process.This is to be noted that the backup files must be recent so that the mongo can catch up the oplogs.This process is faster than the first one. Why Lag Is Harmful

Replication lag means the difference between the time of an operation which is performed on the primary member and the time of the same operation performed on the secondary instance. If replication lag is more means primary’s data is not being replicated to secondary members appropriately which is an alarming situation.If in this case, primary goes down, then it will be a big problem for the application. If there is an excessive replication lag then the secondary members need to be re-sync with the primary and during the re-sync the secondary members become unavailable for the application. Read operations on the cluster may result in inconsistent data. Why Lag Occurs

Some of the reasons of increased replication lag are:

The secondary members are running out of capacity in terms of memory, I/O and CPU. Huge write operations on the primary member leading secondary members unable to replay the oplogs. Index building on the primary member can lead to blocking of all operations on secondary for that time. Locking secondary members for some important reasons like backup.At that time secondary won’t do any write operations on itself which can increase the lag. One or more secondary is dead and has not come up for a long time. Keeping An Eye On The Replication Lag

There are two basic MongoDB commands for viewing replication statistics in MongoDB: rs.printReplicationInfo() prod-mongo-replset:PRIMARY> rs.printReplicationInfo()
configured oplog size: 10240.003829956055MB
log length start to end: 258803secs (71.89hrs)
oplog first event time: Mon Jun 27 2016 18:16:36 GMT+0000 (UTC)
oplog last event time: Thu Jun 30 2016 18:09:59 GMT+0000 (UTC)
now: Thu Jun 30 2016 18:09:59 GMT+0000 (UTC)

It shows the configured/default oplog size which the replication cluster is using currently.The date range shows the start and end date of the MongoDB operations stored in the oplogs.The duration (71.89 hours) shows that the oplogs can store the data of at most 72 hours, if my secondary is unavailable for three days.The oplog size should be set as per the longest downtime which you can expect of your secondary members.

rs.printSlaveReplicationInfo() prod-mongo-replset:PRIMARY> rs.printSlaveReplicationInfo()
source: mongo2.xxxxxx:27017
syncedTo: Thu Jun 30 2016 18:09:48 GMT+0000 (UTC)
1 secs (0 hrs) behind the primary
source: mongo3.xxxxxx:27017
syncedTo: Thu Jun 30 2016 18:09:49 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary

It shows the time delay between the secondary members with respect to primary

rs.status()

It shows the in-depth details of the replication cluster.We can gather enough information about the replication cluster by these commands.

prod-mongo-replset:SECONDARY> rs.status();
{
"set" : "prod-mongo-replset",
"date" : ISODate("2016-06-30T18:17:54.725Z"),
"myState" : 2,
"term" : NumberLong(-1),
"syncingTo" : "mongo1.xxxxxx:27017",
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "mongo1.xxxxxx:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 6732256,
"optime" : Timestamp(1467310674, 562),
"optimeDate" : ISODate("2016-06-30T18:17:54Z"),
"lastHeartbeat" : ISODate("2016-06-30T18:17:54.455Z"),
"lastHeartbeatRecv" : ISODate("2016-06-30T18:17:54.596Z"),
"pingMs" : NumberLong(1),
"electionTime" : Timestamp(1460578491, 1),
"electionDate" : ISODate("2016-04-13T20:14:51Z"),
"configVersion" : 9
},
{
"_id" : 1,
"name" : "mongo2.xxxxxx:27017",
"health"

Troubleshooting MongoDB Replication Cluster

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎