riemann
monitoring
elk
archive
by Bob
Fri Dec 30 2016
In a previous post I said I was interested in replacing Logstash with Rsyslog in our ELK stack. I've been working on one of the outstanding items from that project - the ability to send metrics to Riemann. The Made.com fork of Rsyslog now has an alpha-quality version of the Riemann output module. Although there's still some missing functionality for more advanced scenarios, we can already do interesting things with the module as it stands.
We can achieve this with the rsyslog impstats module and the riemann output module.
Firstly we need to configure rsyslog to record its statistics.
# We'll need to load the riemann module before we can use it
module(load="omriemann")
#impstats generates internal metrics for us on an interval
module(load="impstats"
# use a json format
format="cee"
# send metrics every 10 secs
interval="10"
# send a rate-of-change event, like a collectd counter.
resetCounters="on"
# pass these messages to the "stats" ruleset for processing
ruleset="stats")
module(load="mmjsonparse")
ruleset(name="stats") {
# parse the json message
action(name="parse-stats" type="mmjsonparse")
# pass it to riemann
action(name="riemann-output" type="omriemann"
# look for metrics in the root json object
subtree="!"
# and forward them to MYSERVER
server="MYSERVER")
}
Secondly, we need to add some config to Riemann.
; Listen for TCP connections
(let [host "0.0.0.0"]
(tcp-server {:host host}))
(streams
; ignore riemann's internal metrics
(where (not (tagged "riemann"))
; write all incoming events to stdout
prn
))
After we start both processes up, we should start seeing logs like this from Riemann.
{:service "parse-stats/processed", :state "ok", :metric 5 }
{:service "riemann-output/processed", :state "ok", :metric 5}
So how can we use this to trigger alerts? Firstly let's add some more config to rsyslog. We're going to listen to a TCP socket and we want to monitor the number of messages we receive.
module (load="imuxsock")
input (type="imuxsock" ruleset="sock" Socket="/run/rsyslog/imux.sock")
ruleset(name="sock") {
action(name="write-to-file" type="omfile" file="/var/log/tcp")
}
Now we can write some logs into rsyslog with the logger utility, and we should start seeing new metrics in Riemann.
{:service "imuxsock/submitted", :state "ok" :metric 2}
{:service "write-to-file/processed", :state "ok", :metric 2}
To raise an alert based on these metrics we're going to set some arbitrary thresholds, but we could do more sophisticated analysis to do anomaly detection. We'll set up the following rules:
We'll update our riemann config to look like this
(def email (mailer {:host "smtp.mydomain.com" :user "foo" :pass "bar" :from "riemann@mydomain.com"}))
(streams
(where (service "imuxsock/submitted")
; set up a moving time window over the last 5 mins
(moving-time-window 300; 300 secs == 5 mins
; we want the total number of messages received over the period
(smap folds/sum
; this helper maps threshold values to states
(pipe p (splitp > metric
10 (with :state "error" p) ; < 10 is an error
100 (with :state "ok" p) ; < 100 is ok
Integer/MAX_VALUE (with :state "error" p)); anything 100 to infinity is an error
; If the state changes (from, ok to error or the other direction)
(changed-state
; (email “sysops@mydomain.com”)
prn
))))))
After we update our config, and wait a few minutes, we should see an error logged by riemann.
{:service "imuxsock/submitted", :state "error", :metric 0}
We can test the resolution by running logger again, but this time using watch to run it every 5 seconds: watch -n5 logger "hello world". After another few minutes, riemann should log the ok event.
{:service "imuxsock/submitted", :state "ok", :metric 60}
Lastly, if we change our watch statement so that we call logger every second, we can trigger the other threshold.
{:service "imuxsock/submitted", :state "error", :metric 188}
We've only scratched the surface of Riemann's capabilities here, but we can already see how Rsyslog and Riemann can work together to give actionable alerts from our log data.