Monitoring MongoDB with Nagios

In this blog, we’ll discuss monitoring MongoDB with Nagios.

There is a significant amount of talk around graphing MongoDB metrics using things like Prometheus, Data Dog, New Relic, and Ops Manager from MongoDB Inc. However, I haven’t noticed a lot of talk around “What MongoDB alerts should I be setting up?”

While building out Percona’s remote DBA service for MongoDB, I looked at Prometheus’s AlertManager. After reviewing it, I’m not sure it’s quite ready to be used exclusively. We needed to decide quickly if there are better Nagios checks on the market, or did I need to write my own?

In the end, we settled on a hybrid approach. There are some good frameworks, but we need to create or tweak some of the things needed for an “SEV 1-” or “SEV 2-” type issue (which are most important to me). One of the most common problems for operations, Ops, DevOps, DBA teams and most engineering is alert spam. As such I wanted to be very careful to only alert on the things pointing to immediate dangers orcurrent outages. As a result, we have now added pmp - check - mongo .py to the GitHub for Percona Monitoring Plugins. Since we use Grafana and Prometheus for metrics and graphing, there are no accompanying Catci information templates. In the future, we’ll need to decide how this will change PMP overtime. In the meantime, we wanted to make the tool available now and worry about some of the issues later on.

As part of this push, I want to give you some real world examples of how you might use this tool. There are many options available to you, and Nagios is still a bit green in regards to making those options as user-friendly as our tools are.

Usage: pmp-check-mongo.py [options] Options: -h, --helpshowthis helpmessageand exit -H HOST, --host=HOSTThehostnameyouwantto connectto -P PORT, --port=PORTTheportmongodbis runningon -u USER, --user=USERTheusernameyouwantto loginas -p PASSWD, --password=PASSWDThepasswordyouwantto use for thatuser -W WARNING, --warning=WARNINGThewarningthresholdyouwantto set -C CRITICAL, --critical=CRITICALThecriticalthresholdyouwantto set -A ACTION, --action=ACTIONTheactionyouwantto take. Validchoicesare (check_connections, check_election, check_lock_pct, check_repl_lag, check_flushing, check_total_indexes, check_balance, check_queues, check_cannary_test, check_have_primary, check_oplog, check_index_ratio, check_connect) Default: check_connect -s SSL, --ssl=SSLConnectusingSSL -r REPLICASET, --replicaset=REPLICASETConnectto replicaset -c COLLECTION, --collection=COLLECTIONSpecifythecollectionin check_cannary_test -d DATABASE, --database=DATABASESpecifythedatabasein check_cannary_test -q QUERY, --query=QUERYSpecifythequery, onlyusedin check_cannary_test --statusfile=STATUS_FILENAMEFile to currentstorestatedatain for delta checks --backup-statusfile=STATUS_FILENAME_BACKUPFile to previousstorestatedatain for delta checks --max-stale=MAX_STALEAgeofstatusfile to make new checks (seconds)

There seems to be a huge amount going on here, but let’s break it down into a few categories:

Connection options Actions Action options Status options

Hopefully, this takes some of the scariness out of the script above.

Connection options

Host / Port Number Pretty simple, this is just the host you want to connect to and what TCP port it is listening on. Username and Password Like with Host/Port, this is some of your normal and typical Mongo connection field options. If you do not set both the username and password, the system will assume auth wasdisabled. SSL This is mostly around the old SSL support in Mongo clients (which was a boolean). This tool needs updating to support the more modern SSL connection options. Use this as a “deprecated” feature that might not work on newer versions. ReplicaSet Very particular option that is only used for a few checks and verifies that the connection uses a replicaset connection. Using this option lets the tool automatically find a primary node for you, and is helpful to some checks specificallyaround replication and high availability (HA): check_election check_repl_lag check_cannary_test chech_have_primary check_oplog

Actions and what they mean

check_connections This parameter refers to memory usage, but beyond that you need to know if your typical connections suddenly double. This indicates something unexpected happened in the application or database and caused everything to reconnect. It often takes up to 10 minutes for those old connections to go away. check_election This uses the status file options we will cover in a minute, but it checks to see if the primary from the last check differs from the current found primary. If so, it alerts. This check should only have a threshold of one before it alarms (as an alert means an HA event occurred). check_lock_pct MMAP only, this engine has a write lock on the whole collection/database depending on the version. This is a crucial metric to determine if MMAP writes are blocking reads, meaning you need to scale the DB layer in some way. check_repl_lag Checks the replication stream to understand how lagged a given node is the primary. To accomplish this, it uses a fake record in the test DB to cause a write. Without this, a read-only system would look lagged artificially as no new oplog entries get created. check_flushing A common issue with MongoDBis very long flush times, causing a system halt. This is a caused by your disk subsystem not keeping up, and then the DB having to wait on flushing to make sure writes get correctly journaled. check_total_indexes The more indexes you have, the more the planner has to work to determine which index is a good fit. This increases the risk that the recovery of a failure will take a long time. This is due to the way a restore builds indexes and how MongoDB can only make one index at a time. check_balance While MongoDB should keep things in balance across a cluster, many things can happen: jumbo chunks, a disabled balancer being, constantly attempting to move the same chunk but failing, and even adding/removing sharding. This alert is for these cases, as an imbalance means some records might getserved faster than others. It is purely based on the chunk count that the MongoDB balancer is also based on, which is not necessarily the same as disk usage. check_queues No matter what engine you have selected, a backlog of sustained reads or writes indicates your DB layer is unable to keep up with demand. It is important in these cases to send an alert if the rate is maintained. You might notice this is also in our Prometheus exporter for graphics as both trending and alerting are necessary to watch in a MongoDB system. check_cannary_test This is a typical query for the database and

Monitoring MongoDB with Nagios

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本