Measuring Capacity Through Utilization

One of my favorite concepts when thinking about instrumenting a system to understand capacity is what I call “time utilization”. This is the question of, for a given thread in some system, what fraction of its time is spent doing what kind of work?

Let’s consider this notion via the concrete example of a job that consumes request logs off of a queue and updates hit counts in a database:

while True: req = queue.pop() route = compute_route(req) db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) req.finish()

How should we instrument this? How can we monitor how it is performing, understand why it’s getting slow if it ever does get too slow, and make an informed decision about how many such processes we need to run in order to comfortably our load?

We’re making several network calls (to our queue and to our database) and doing some local computation ( compute_route ). A good first instinct is to instrument the network calls, since those are likely the slow parts:

# ... with statsd.timer('hitcount.update'): db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) with statsd.timer('hitcount.finish): req.finish() # ...

With most statsd implementations, this will give you histogram statistics on the latencies of your database call; Perhaps it will report the median, p90 , p95 , and p99 .

We can definitely infer some knowledge about this process’s behavior from those statistics; If the p50 on hitcount.update is too high, likely our database is overloaded and we need to do something about that. But how should we interpret a slow p95 or p99? This is a background asynchronous job, so we don’t overly care about the occasional slow job, as long as it keeps up in general. And how do we tell if we’re running near capacity? If it’s making multiple calls, how do we aggregate the histograms into a useful summary?

We can get a much cleaner picture if we realize that don’t care about the absolute time spent in each component so much as we care about the fraction of all time spent in each phase of the component’s lifecycle. And we can instrument that by just counting the total time spent in each phase, and then configuring our frontend to normalize to a per-second count. That would look something like this (I’m assuming dogstatsd here, and a tag-based scheme):

@contextmanager def op(what): start = time.time() yield statsd.increment('hitcount.total_s', value=(time.time() - start), tags=["op:" + what]) while True: with op('receive'): req = queue.pop() with op('compute_route'): route = compute_route(req) with op('update): db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) with op('finish'): req.finish()

We can now tell our metrics system to construct a stacked graph of this metric, stacked by tags, and normalized to show count-per-second:

The graph total shows you the number of workers, since each worker performs one second-per-second of work (above, I’m running with four workers, so we’re close, but we’re losing a bit of time somewhere).

We can also immediately see that we’re spending most of our time talking to the database 2.7s or (dividing by 4) about 68% of our time.

The above image shows a system at running at capacity. If we’re not, then the graph gives us an easy way to eyeball how much spare capacity we have:

Measuring Capacity Through Utilization

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎