250 GB/day of logs with Graylog: The good, the bad and the ugly

Architecture
250 GB/day of logs with Graylog: The good, the bad and the ugly

Graylog Architecture Load Balancer : Load balancer for log input (syslog, kafka, GELF, …) Graylog : Logs receiver and processor + Web interface ElasticSearch : Logs storage MongoDB : Configuration, user accounts and sessions storage Costs Planning Hardware requirements Graylog : 4 cores, 8 GB memory (4 GB heap) ElasticSearch : 8 cores, 60 GB memory (30 GB heap) MongoDB : 1 core, 2 GB memory (whatever comes cheap) AWS bill + $ 1656 elasticsearch instances (r3.2xlarge) + $ 108 EBS optimized option + $ 1320 12TB SSD EBS log storage + $ 171 graylog instances (c4.xlarge) + $ 100 mongodb instances (t2.small :D) =========== = $ 3355 x 1.1 premium support =========== = $ 3690 per month on AWS GCE bill + $ 760 elasticsearch instances (n1-highmem-8) + $ 2040 12 TB SSD EBS log storage + $ 201 graylog instances (n1-standard-4) + $ 68 mongodb (g1-small :D) =========== = $ 3069 per month on GCE

GCE is 9% cheaper in total. Admire how the bare elasticsearch instances are 55% cheaper on GCE (ignoring the EBS flag and support options).

The gap is diminished by SSD volumes being more expensive on GGE than AWS ($0.17/GB vs $0.11/GB). This setup is a huge consumer of disk space. The higher disk pricing is eating part of the savings on instances.

Note: The GCE volume may deliver 3 times the IOPS and throughput of its AWS counterpart. You get what you pay for.

Capacity Planning Performances (approximate) 1600 log/s average, over the day 5000 log/s sustained, during active hours 20000 log/s burst rate Storage (as measured in production) 138 906 326 logs per day (averaged over the last 7 days) 2200 GB used, for 9 days of data 1800 bytes/log in average

Our current logs require 250 GB of space per day. 12 TB will allow for 36 days of log history (at 75% disk usage).

We want 30 days of searchable logs. Job done!

Competitors ELK

Dunno, never seen it, never used it. Probably a lot of the same.

Splunk Licensing

The Splunk licence is based on the volume ingested in GB/day. Experience has taught us that we usually get what we pay for, therefore we love to pay for great expensive tools (note: ain’t saying splunk is awesome, don’t know, never used it). In the case of Splunk vs ELK vs Graylog. It’s hard to justify the enormous cost against two free tools which are seemingly okay.

We experienced a DoS an afternoon, a few weeks after our initial small setup: 8000 log/s for a few hours while we were planning for 800 log/s.

A few weeks later, the volume suddenly went up from 800 log/s to 4000 log/s again. This time because debug logs and postgre performance logs were both turned on in production. One team was tracking an Heisenbug while another team felt like doing some performance analysis. They didn’t bother to synchronise.

These unexpected events made two things clear. First, Graylog proved to be reliable and scalable during trial by fire . Second, log volumes are unpredictable and highly variable. A volume-based licensing is a highway to hell, we are so glad to not have had to put up with it.

Judging by the information on Splunk website, the license for our current setup would be in the order of $160k a year. OMFG!

How about the cloud solutions? SumoLogic Scalyr

One word : No.

Two words: Strong No.

The amount of sensitive information and private user data available in logs make them the ultimate candidate for not being outsourced, at all, ever.

No amount of marketing from SumoLogic is gonna change that.

Note: We may to be legally forbidden to send our logs data to a third party. Even thought that would take a lawyer to confirm or deny it for sure.

Log management explained

Feel free to read “ Graylog ” as “< other solution >”. They’re all very similar with most of the same pros and cons.

What Graylog is good at debugging & postmortem security and activity analysis regulations Good: debugging & postmortem

Logs allow to dive into what happened millisecond by millisecond. It’s the first and last resort tool when it comes to debugging issues in production.

That’s the main reason logs are critical in production. We NEED the logs to debug issues and keep the site running.

Good: activity analysis

Logs give an overview of the activity and the traffic. For instance, where are most frontend requests coming from? who connected to ssh recently?

Good: regulations

When we gotta have searchable logs and it’s not negotiable, we gotta have searchable logs and it’s not negotiable. #auditing

What Graylog is bad at (non trivial) analytics graphing and dashboards metrics (ala. graphite) alerting Bad: (non trivial) Analytics Facts:

1) ElasticSearch cannot do join nor processing (ala mapreduce)

2) Log fields have weak typing

3) [Many] applications send erroneous or shitty data (e.g. nginx)

Everyone knows that an HTTP status code is an integer. Well, not for nginx. It can log an upstream_status_code ‘ 200 ‘ or ‘ ‘ or ‘ 503, 503, 503 ‘. Searching nginx logs is tricky and statistics are failing with NaN errors (Not a Number).

Elasticsearch itself has weak typing. It tries to detect field types automatically with variable success (i.e. systematic failure when receiving ambiguous data, defaulting to string type).

The only workaround around is to write field pre/post processors to sanitize inputs but it’s cumbersome when there are unlimited applications and fields each requiring a unique correction.

In the end, the poor input data can break simple searches. The inability to do joins prevents from running complex queries at all.

It would be possible to do analytics by sanitizing log data daily and saving the result to BigQuery/RedShift but it’s too much effort. We better go for a dedicated analytics solution, with a good data pipeline (i.e. NOT syslog).

Lesson learnt: Graylog doesn’t replace a full fledged analytics service.

Bad: Graphing and dashboards

Graylog doesn’t support many kind of graphs. It’s either “ how-many-logs-per-minute ” or “ see-most-common-values-of-that-field ” in the past X minutes. (There will be more graphs as the product mature, hopefully). We could make dashboards but we’re lacking interesting graphs to put into them.

edit: graylog v2 is out, it adds automatic geolocation of IP addresses and a map visualization widget.

Bad: Metrics and alerting Graylog is not meant to handle metrics. It doesn’t gather metrics. The graphs and dashboards capabilities are too limited to make anything useful even if metrics were present. The alerting capability is [almost] non existent.

Lesson learnt: Graylog does NOT substitute to a monitoring system. It is not in competition with datadog and statsd.

Special configuration

250 GB/day of logs with Graylog: The good, the bad and the ugly

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

新华网这张照片绝了!直讽江泽民宋祖英淫乱组图

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

Google Chrome Portable 140.0.7339.186 穩定版免安裝中文版 - Google 瀏覽器

免费翻墙节点大全