Rek it

by Bob

Thu Nov 24 2016

At Made.com we're unabashed fans of the ELK stack, and I spend a decent amount of my time thinking about how we parse, ship, and store logs.

Currently, we use an ELK stack setup that looks like this:

Rsyslog receives logs from our Docker containers, via the syslog docker logging driver, and from the rest of the system via the journal. We tag the logs at this point (as application, http, or system logs), and normalise them all to the same json format.

We ship those json logs to an Elasticache Redis instance, and consume them in Logstash. Finally, Logstash routes the logs to the correct Elasticsearch index based on their tags.

This is more-or-less a best-practice set up for ELK but Logstash, honestly, is my least favourite part of the stack. Recently I've had some conversations on the Rsyslog mailing list about whether we can replace Logstash entirely, and just use Rsyslog.

Why might we want to do that?

RSyslog is really fast. Logstash does a reasonable job of processing our current log volumes, but RSyslog is a couple of orders of magnitude faster [http://www.rsyslog.com/performance-tuning-elasticsearch/] for most use cases. In particular, mmnormalize is lightning fast compared to Logstash's grok. That means I can run smaller, cheaper ingest boxes.
Logstash reliability issues. Rsyslog is definitely not a bug-free piece of software, but I know my way around the code base, and I've inadvertently become a maintainer of the Redis module. We've had repeated issues where Logstash loses its connection to either Redis or Riemann, and failed messages then block the entire pipeline.
Simplicity Given that RSyslog has disk-backed queues, maybe we don't need to use Redis at all. If we can reliably buffer and forward messages to a central RSyslog server, and then into Elasticsearch, we can drop a component from the pipeline.

To test whether that's feasible, I want to build a couple of prototype architectures. Firstly I want to try our basic setup but replacing Logstash directly with RSyslog

Secondly I'd like to try skipping Redis altogether and going directly to a central rsyslog server using relp. I suspect this will have better throughput, and lower running costs, without loss in reliability.

For all of this to really work well, I'm going to have to roll up my sleeves and write some code.

We'll need GeoIP tagging for our http logs. I've seen this done with both mmexternal and table lookup functionality.
I'll need to hack out a Redis input plugin for RSyslog.
We forward certain logs (anything with a _metric field, plus ERROR and CRITICALs) to Riemann so that we can monitor them and use them for alerting. We'll need a new Riemann output module in ryslog.
The chief maintainer of RSyslog is interested in writing or harvesting a component that works like Elastic's FileBeats - a simple, lightweight forwarder that just reads files and sends logs over the network reliably.

I'm going to open source all the prototypes, including Ansible playbooks and Docker files so that you can play along at home or in the cloud. If you've got any suggestions for other reference architectures or use cases you'd like to see from the REK stack get in touch on the mailing list [http://lists.adiscon.net/mailman/listinfo/rsyslog].