Posted by: ainmosni
On 08/02/2012 19:08
Categories: System Engineering

When I started at my current job, I was quite overwhelmed by the amount of applications we had running and how these were all related in one way or another, application A depended on information from application B which in turn needed application C to make sure that application D wasn't sending the wrong information to application E and so on. Of course, this made for a very hard to debug stack as an error in application A could be caused by a fault in application B, C, D, E, any of their dependencies or even a combination of all of these.

As you can imagine, all these applications generated an enormous amount of log data that was spread all over the place. Debugging any problem became just as much a scavenger hunt as it was problem solving as you had to find all the relevant logs, find the time of the problem and correlate all that data manually. This was a tedious mission to embark on and made any root cause analysis very time consuming and one was less motivated to embark on said mission.

Being very lazy when it comes to tedious tasks and inventive when it comes to creating tools that can alleviate or even eliminate these tasks, I dived right in and started writing a suite that parsed the log and indexed them in a relational database so that I could quickly drill down in multiple logs at the same time. To make it even more visible, I added optional graphing of customisable data and alerting for specific patterns, I was quite pleased with it and I presented it to my manager.

My manager was quite impressed with what he saw but he also informed me that another part of the company has been using Splunk to do something very similar to what my suite did but it also did a lot more than what my homegrown creation did. That Splunk did more than my project didn't surprise me as I've heard of it quite a long time before I started to work here but I was convinced that, even though I would probably never reach feature parity with Splunk, the money saved on licensing would be worth continued development on my tooling but as the investment was already made, I was sent to implement Splunk in my part of the company.

To be honest, I wasn't that impressed with Splunk at first, it was weird, alien and didn't seem to deliver functionality worth of its pricetag... at least that's what I thought until I got a proper demonstration of Splunk, it didn't just act like a centralised web interface to "grep" but you could add a lot of intelligence in a flexible way, you could create indexes on the fly, link those indexes to corresponding fields in other logs and then search for all logs that were related in a way you didn't even realise they were related 5 minutes ago.

The flexibility of Splunk might be the main thing that impressed me but it didn't stop there, the syntax of the search resembled a cross between a "Google-like" syntax that most of us are quite familiar with and shell "pipe" syntax that UNIX administrators should feel more than at home with. Further more, it was a snap to turn the searches in graphs and include these graphs into a dashboard that you can give managers and first line support engineers so that they can see how well everything is running and/or selling. One last feature I was quite pleased with was that one transforms your search into a trigger that could call a script or an email, this was ideal for alerts and could replace many custom monitoring scripts that I've written over the years.

So, after admitting that Splunk was a lot more powerful than what I made myself I started making a design to implement it, for our intents and purposes, Splunk had two ways of indexing log data, sending syslog data to Splunk directly or using a Splunk 'forwarder' which parsed the log and sent it to be indexed. Syslog seemed like the obvious choice so we implemented that first but we ran into a wall, Splunk's syslog saw multiline log messages as one log message per line, this was unacceptable as that made searching through these (mostly critical errors) messages next to impossible. After doing some research, it was possible to change Splunk's syslog to accept these correctly but this would make it useless for normal syslog entries.

I proceeded to test out the forwarder which turned out to be rather light and correctly parsed any logs we threw at it, we chose to let the forwarder index the logs. This left one problem to tackle though, it was less than optimal to have an extra service on any server that contained interesting data that we wished to index and it also forced us to write the logs to a local disk while also reading from it at the same time, an ideal solution gave us the option to have logs written directly to splunk so that the disk wasn't wasting iops on logs that you wouldn't read there anyway. These factors made use reconsider syslog but Splunk's syslog receiver was still not acceptable but other syslog daemons did everything we needed. We finally decided to have all our logs write to a central 'buffer' machine and have splunk index from this machine with its nice little forwarder. This had an additional benefit that if Splunk had any issues or if raw logs were needed, we could still access them on this machine.

In the end, I'm converted and would recommend Splunk to any company that can pay it. Beware though, the pricing is not cheap and it's up to you if it's worth the price, for me it has been.

Tags: log splunk sysops
Related articles:
comments powered by Disqus

Who is this guy anyway?

I'm Daniël Franke, a Python developer/System engineer hybrid working at Booking.com. I've been coding and administrating servers for most of my life and this is the corner of the internet where I can just rant and/or rave about anything I want without anyone stopping me. This blog will be mostly tech centred but it can go off-topic sometimes.

Categories