Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

Monitoring MongoDB with Nagios

$
0
0

Monitoring MongoDB with Nagios
In this blog, we’ll discuss monitoring MongoDB with Nagios.

There is a significant amount of talk around graphing MongoDB metrics using things like Prometheus, Data Dog, New Relic, and Ops Manager from MongoDB Inc. However, I haven’t noticed a lot of talk around “What MongoDB alerts should I be setting up?”

While building out Percona’s remote DBA service for MongoDB, I looked at Prometheus’s AlertManager. After reviewing it, I’m not sure it’s quite ready to be used exclusively. We needed to decide quickly if there are better Nagios checks on the market, or did I need to write my own?

In the end, we settled on a hybrid approach. There are some good frameworks, but we need to create or tweak some of the things needed for an “SEV 1-” or “SEV 2-” type issue (which are most important to me). One of the most common problems for operations, Ops, DevOps, DBA teams and most engineering is alert spam. As such I wanted to be very careful to only alert on the things pointing to immediate dangers orcurrent outages. As a result, we have now added pmp - check - mongo .py to the GitHub for Percona Monitoring Plugins. Since we use Grafana and Prometheus for metrics and graphing, there are no accompanying Catci information templates. In the future, we’ll need to decide how this will change PMP overtime. In the meantime, we wanted to make the tool available now and worry about some of the issues later on.

As part of this push, I want to give you some real world examples of how you might use this tool. There are many options available to you, and Nagios is still a bit green in regards to making those options as user-friendly as our tools are.

Usage: pmp-check-mongo.py [options] Options: -h, --helpshowthis helpmessageand exit -H HOST, --host=HOSTThehostnameyouwantto connectto -P PORT, --port=PORTTheportmongodbis runningon -u USER, --user=USERTheusernameyouwantto loginas -p PASSWD, --password=PASSWDThepasswordyouwantto use for thatuser -W WARNING, --warning=WARNINGThewarningthresholdyouwantto set -C CRITICAL, --critical=CRITICALThecriticalthresholdyouwantto set -A ACTION, --action=ACTIONTheactionyouwantto take. Validchoicesare (check_connections, check_election, check_lock_pct, check_repl_lag, check_flushing, check_total_indexes, check_balance, check_queues, check_cannary_test, check_have_primary, check_oplog, check_index_ratio, check_connect) Default: check_connect -s SSL, --ssl=SSLConnectusingSSL -r REPLICASET, --replicaset=REPLICASETConnectto replicaset -c COLLECTION, --collection=COLLECTIONSpecifythecollectionin check_cannary_test -d DATABASE, --database=DATABASESpecifythedatabasein check_cannary_test -q QUERY, --query=QUERYSpecifythequery, onlyusedin check_cannary_test --statusfile=STATUS_FILENAMEFile to currentstorestatedatain for delta checks --backup-statusfile=STATUS_FILENAME_BACKUPFile to previousstorestatedatain for delta checks --max-stale=MAX_STALEAgeofstatusfile to make new checks (seconds)

There seems to be a huge amount going on here, but let’s break it down into a few categories:

Connection options Actions Action options Status options

Hopefully, this takes some of the scariness out of the script above.

Connection options

Host / Port Number Pretty simple, this is just the host you want to connect to and what TCP port it is listening on. Username and Password Like with Host/Port, this is some of your normal and typical Mongo connection field options. If you do not set both the username and password, the system will assume auth wasdisabled. SSL This is mostly around the old SSL support in Mongo clients (which was a boolean). This tool needs updating to support the more modern SSL connection options. Use this as a “deprecated” feature that might not work on newer versions. ReplicaSet Very particular option that is only used for a few checks and verifies that the connection uses a replicaset connection. Using this option lets the tool automatically find a primary node for you, and is helpful to some checks specificallyaround replication and high availability (HA): check_election check_repl_lag check_cannary_test chech_have_primary check_oplog

Actions and what they mean

check_connections This parameter refers to memory usage, but beyond that you need to know if your typical connections suddenly double. This indicates something unexpected happened in the application or database and caused everything to reconnect. It often takes up to 10 minutes for those old connections to go away. check_election This uses the status file options we will cover in a minute, but it checks to see if the primary from the last check differs from the current found primary. If so, it alerts. This check should only have a threshold of one before it alarms (as an alert means an HA event occurred). check_lock_pct MMAP only, this engine has a write lock on the whole collection/database depending on the version. This is a crucial metric to determine if MMAP writes are blocking reads, meaning you need to scale the DB layer in some way. check_repl_lag Checks the replication stream to understand how lagged a given node is the primary. To accomplish this, it uses a fake record in the test DB to cause a write. Without this, a read-only system would look lagged artificially as no new oplog entries get created. check_flushing A common issue with MongoDBis very long flush times, causing a system halt. This is a caused by your disk subsystem not keeping up, and then the DB having to wait on flushing to make sure writes get correctly journaled. check_total_indexes The more indexes you have, the more the planner has to work to determine which index is a good fit. This increases the risk that the recovery of a failure will take a long time. This is due to the way a restore builds indexes and how MongoDB can only make one index at a time. check_balance While MongoDB should keep things in balance across a cluster, many things can happen: jumbo chunks, a disabled balancer being, constantly attempting to move the same chunk but failing, and even adding/removing sharding. This alert is for these cases, as an imbalance means some records might getserved faster than others. It is purely based on the chunk count that the MongoDB balancer is also based on, which is not necessarily the same as disk usage. check_queues No matter what engine you have selected, a backlog of sustained reads or writes indicates your DB layer is unable to keep up with demand. It is important in these cases to send an alert if the rate is maintained. You might notice this is also in our Prometheus exporter for graphics as both trending and alerting are necessary to watch in a MongoDB system. check_cannary_test This is a typical query for the database and

Viewing all articles
Browse latest Browse all 6262

Trending Articles