Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

Adventures in Performance Debugging

$
0
0

As we’ve built CockroachDB, correctness has been our primary concern. But as we’ve drawn closer to our beta launch, we’ve had to start paying significantly more attention to performance. The design of CockroachDB always kept performance and scalability in mind, but when you start measuring performance, there are inevitably surprises. This is the story of the detection, investigation, and fix of just one performance bug .

First, a little context about CockroachDB for those new to the project. CockroachDB is a distributed SQL database which employs RocksDB to store data locally at each node. A basic performance test is to write random data into a table and test how many operations per second can be achieved. This is exactly what the block_writer example program does. It uses a simple table schema:

CREATE TABLE IF NOT EXISTS blocks (
block_id BIGINT NOT NULL,
writer_id STRING NOT NULL,
block_num BIGINT NOT NULL,
raw_bytes BYTES NOT NULL,
PRIMARY KEY (block_id, writer_id, block_num)
)

CREATE TABLE IF NOT EXISTS blocks ( block_id BIGINT NOT NULL, writer_id STRING NOT NULL, block_num BIGINT NOT NULL, raw_bytes BYTES NOT NULL, PRIMARY KEY (block_id, writer_id, block_num) )

And then spawns a number of workers to insert data into the table:

INSERT INTO blocks (block_id, writer_id, block_num, raw_bytes)
VALUES ($1, $2, $3, $4)

INSERT INTO blocks (block_id, writer_id, block_num, raw_bytes) VALUES ($1, $2, $3, $4)

The block_id is randomly chosen and writer_id is uniquely assigned to each worker. The block_num field is monotonically increasing and ensures that there will never be duplicate rows inserted into the table. The effect is that we’re inserting random rows into the table and never experiencing contention. What could go wrong?

1. The Bug: Rapid performance deterioration

A few weeks ago my colleague Matt Tracy ran the block_writer and discovered rapid performance deterioration:

1s: 1944.7/sec
2s: 1067.3/sec
3s: 788.8/sec
4s: 632.8/sec
5s: 551.5/sec

1m0s: 105.2/sec

1s: 1944.7/sec 2s: 1067.3/sec 3s:788.8/sec 4s:632.8/sec 5s:551.5/sec … 1m0s:105.2/sec

Oh my, that isn’t good. Performance starts out at a reasonable 2000 ops/sec, but quickly falls by a factor of 20x. Matt suspected there was some scalability limitation with tables. He noted that once a table fell into this bad performance regime it stayed there. But if he dropped the blocks table and created a new one, performance reset only to degrade rapidly again.

Like any good engineer, Matt turned to cpu profiling to try and determine what was going on. Was there some loop with horrible algorithmic complexity based on the table size? Unfortunately, the profiles didn’t reveal any culprits. Most of the cpu time was being spent inside RocksDB, both during the good performance regime and the bad performance regime. The builtin Go profiling tools are quite good, but they are unable to cross the cgo boundary (RocksDB is written in C++).

2. Snowball tracing to the rescue

Matt was a bit stumped for how to proceed at this point. Conveniently, another engineer, Tobias Schottdorf, was experimenting with adding “snowball” tracing to SQL queries. Unlike sampling-based profilers which periodically stop a program and determine what code is running, a tracing system records implicit or explicit events associated with a specific request. The support Tobias was adding was a new EXPLAIN (TRACE) mode. After the addition of some more tracing events, here is what Matt saw:

EXPLAIN (TRACE) INSERT INTO
blocks (block_id, writer_id, block_num, raw_bytes)
VALUES (1, 100, 1, ‘’)

92.947s | 9 | node | execute
3.653s | 10 | node | executing
3.129s | 11 | node | BeginTransaction
2.088s | 12 | node | Transaction
4.573606ms | 13 | node | Transaction was not present
9.721s | 14 | node | checksum
417ns | 15 | node | Got put buffer
7.501s | 16 | node | mvccGetMetadata
2.847048ms | 17 | node | mvccGetMetadata
660ns | 18 | node | getMetadata
12.128s | 19 | node | Put internal
352ns | 20 | node | Wrote transaction
2.207s | 21 | node | command complete
5.902s | 22 | node | executing
30.517s | 23 | node | Got put buffer

EXPLAIN (TRACE) INSERT INTO blocks (block_id, writer_id, block_num, raw_bytes) VALUES (1, 100, 1, ‘’) … 92.947s| 9 | node| execute 3.653s |10 | node| executing 3.129s |11 | node| BeginTransaction 2.088s |12 | node| Transaction 4.573606ms|13 | node| Transaction was not present 9.721s |14 | node| checksum 417ns |15 | node| Got put buffer 7.501s |16 | node| mvccGetMetadata 2.847048ms|17 | node| mvccGetMetadata 660ns |18 | node| getMetadata 12.128s|19 | node| Put internal 352ns |20 | node| Wrote transaction 2.207s |21 | node| command complete 5.902s |22 | node| executing 30.517s|23 | node| Got put buffer

I’ve edited the output for clarity and highlighted two lines which shed light on the problem. It should be clear that you can’t achieve 2000 ops/sec, or 1 op every 0.5 ms, if part of the time to perform an operation takes >7ms. It is interesting that this time is being consumed in writing the transaction record at the start of a transaction.

Matt continued to add more instrumentation until the problem was narrowed down to a single RocksDB operation. At this point Matt tagged me in since I’ve had the most experience with our usage of RocksDB. I came onto the field swinging and wrote a micro-benchmark that duplicated the behavior of the BeginTransaction and utterly failed to find any performance degradation. Hmm, curious. I decided to verify I could reproduce the block_writer performance degradation (trust, but verify) and, of course, the problem reproduced immediately. I also verified that checking to see if the transaction record was present at the start of a transaction was the time consuming operation.

3. RocksDB, CockroachDB, and MVCC Now to provide a bit of background on CockroachDB’s MVCC (multi-versio

Viewing all articles
Browse latest Browse all 6262

Trending Articles