SQL Server DBA’s Guide to the Gitlab Outage

This week, developer tools company GitLab had a serious database outage.

The short story:

An admin was trying to set up replication The site had an unusual load spike, causing replication to get behind While struggling with troubleshooting, the admin made a lot of changes After hours of work, the admin accidentally deleted the production database directory

You can read more about the details in GitLab’s outage timeline doc , which they heroically shared while they worked on the outage. Oh, and they streamedthe whole thing live on YouTube with over 5,000 viewers.

There are so many amazing lessons to learn from this outage: transparency, accountability, processes, checklists, you name it. I’m not sure that you, dear reader, can actually put a lot of those lessons to use, though. After all, your company probably isn’t going to let you live stream your outages. (I do pledge to you that I’m gonna do my damnedest to do that ourselves with our own services, though.)

I want you to zoom in on one particular part: the backups.

After the above stuff happened, it was time to recover from backups. In the outage timeline doc, scroll down to the Problems Encountered section, and you’ll see 7 bullet points. GitLab used PostgreSQL, linux, and Azure VMs, but I’m going to translate these into a language that you, the SQL Server user, can understand.

SQL Server DBA’s Guide to the Gitlab Outage

My new transparency heroes, up there with Stack Overflow

Their 7 layers of protection were:

LVM snapshots taken every 24 hours Regular backups every 24 hours Disk snapshots in Azure Synchronization to staging Replication Backups to Amazon S3 Backup failure alerts

Let’s turn this into SQL Server on windows language.

1. OS volume snapshots

In Windows, these are VSS (shadow copy) snaps . They freeze SQL Server’s writes for a matter of seconds to get a consistent picture of the MDF/NDF/LDF files of all of your databases on that volume. (These are not database snapshots, which are also useful in some cases, but unrelated.)

VSS is a building block, and you don’t hear Windows admins just using the term VSS by itself without also referring to a third party backup product. These are usually the products you despise, like NetBackup, which use VSS to substitute for full backups. Depending on your vendor,you may or may not be able to apply additional point-in-time transaction log backups to them. If the product doesn’t have that capability, it usually resorts to doing VSS snaps every X minutes, so it looks like you have a full backup every X minutes that you can restore to but no other point in time.

Because of that, they’re usually a last-resort for SQL Server users where point-in-time recovery is required. (However, they’re better than nothing.)

2. Regular database backups

You’re familiar with native BACKUP DATABASE commands in SQL Server, and you probably know the difference between:

Full backups typically done daily or weekly Log backups done every X minutes, and you should probably be doing them more often

These are usually a DBA’s first choice for recovery. However, you’re only as good as your last restore. (In GitLab’s case, their backups were failing silently.)

I adore transaction log shipping because it’s essentially testing my log backups all the time. Log shipping is easy to set up, nearly bulletproof, and works with all versions/editions of SQL Server. Don’t think of it as just your disaster recovery: it’s also verifying that you’ve got good backup files.

3. Disk snapshots in Azure

On premises, this is the equivalent of a SAN snapshot or a VMware snapshot. The exact implementation details can either mean that the entire VM is snapshotted, or just the data/log drives.

This is a great insurance policy, and I hearsome advanced SQL Server DBAs saying they dothisbefore they undertake something dangerous like a SQL Server version upgrade or a massive schema change. However, rollback is all-or-nothing: if you revert the snapshot, you’re going to lose everything since the snapshot. (That’s why it makes sense for scheduled outages involving big changes with no simultaneous end user access.)

The usual problem with relying on volume snapshots as part of your normal recovery routine (not manual snaps) is that they’re done outside of the SQL Server DBA’s territory. The SAN admin usually controls when they happen, and who has access to them. If you’re going to rely on volume snapshots as part of your backup plan, you have to test those snaps.

In a perfect world, you build automation so that your snapshots are immediately made available to a second SQL Server, which then performs a CHECKDB on that snap. However, that costs licensing money plus personnel time, so I rarely see it done. Folks just assume their snaps are okay butthe SAN error emails aren’t sent to the DBAs.

4. Synchronization to staging

In GitLab’s case, they were pulling parts of the data to another environment. (GitLab users I’m going to take some liberties here with the description.)

In your case, think about a periodic ETL process that takes data from production and pushes it into staging tables in a data warehouse. If the poop really hit the fan hard, you might be able to recover some of your most critical data that way.

The DBAs in the audience might immediately give that method the finger, but keep in mind that we’re in a new age of DevOps here. If everybody’s a developer, then you can do more creative code-based approaches to recovery.

5. Replication

Outside of SQL Server, it’s common to see replication used as a high availability and disaster recovery technology. Other platforms just echo their delete/update/insert operations to other servers.

GitLab wrote that their “replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented” and I can certainly relate to that. I’ve been in replication environments like that. (I built one like that when I tookthe Microsoft Certified Masterexam, ha ha ho ho.)

In the SQL Server world, I see a lot of replication setups like that, so the thought of using replication for HA/DR usually prompts reactions of horror. It’s just a cultural thing: we’re more accustomed to using either direct copies of transaction log data, or direct copies of the data pages.

So when you read “replication” in GitLab’s post-mortem, think database mirroring or Always On Availability Groups. It’s not the same it’s just what we would culturally use.

Just like your backups can fail, your AGs can fail.Replication breaks, service packs have surprises , all kinds of nastiness . Data replication in any form doesn’t make your job easier it becomes harder, and you have to staff up for it.

6. Backups to Amazon S3 I’m a huge fan of cross-cloud backups because

SQL Server DBA’s Guide to the Gitlab Outage

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本