Oct 4th, 2019 - written by Kimserey with .
Splunk is a log aggregator in the same way as elastic search with Kibana can be used. When I started using Splunk I immediately acknowledged its capabilities, and its usage was largely limited by my own knowledge of writing queries (which is still very low). But every now and then I would see myself in a situation where I would need to compose the same query which I did the week before but now have forgotten how to. So today we’ll explore some nice Splunk functionalities.
The function I use the most is
timechart. It provides a way to plot a time series where we can specify a span, for the precision, an aggregation function for the events falling in the buckets, and a split clause to group events.
1 ... | timechart span=5m p99(upstream_response_time)
This will get us the
p99 for the
upstream_response_time for a span of 5 minutes where we will see across all our events, useful to monitor the overall latency of our service.
1 ... | timechart span=5m p99(upstream_response_time) by host
Specifying a split clause
by host will generate multiple time series, one per host, useful to monitor the latency on specific instances and identify potential issue specific to a particular host.
We can only specify a single split clause but if we want to separate with two fields, we can use
eval which creates a new property in the event, and we can make use of it in our split clause.
1 2 3 ... | eval host_method=host+"@"+method | timechart span=5m p99(upstream_response_time) by host_method
This will add a property
host_method on each event combining the
host and the
method and allowing a split on the combination.
Formatting in two line the query is useful when we want to debug a query as we are able to comment a part of the query using the
1 2 3 ... | eval host_method=host+"@"+method `comment("| timechart span=5m p99(upstream_response_time) by host_method")`
Eval can also be used to construct new properties using
1 2 3 4 ... | eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR") | search stats_str!="" | timechart span=5m count by stats_str
This will remove the
4xx status code and tag the events of
ERROR then produce a timechart on it.
Splunk limits the split values and put the rest into an
OTHER bucket. We can lift that limit off by specifying
1 2 3 4 ... | eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR") | search stats_str!="" | timechart span=5m limit=0 count by stats_str
The other aspect of timechart is that it produces a table of split values, indexed by the time. For example when we did
by stats_str, we would have table with the first column as the time, and the rest of the columns as the
Knowing that we can compute the overall availability of our service by using the
1 2 3 4 5 6 ... | eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR") | search stats_str!="" | timechart span=5m limit=0 count by stats_str | eval success_rate = round((OK / (OK + ERROR)) * 100, 2) | fields - ERROR OK
Once we generate the table with
timechart, we use
eval to compute the success rate and then use
fields - [fields] to remove the fields
OK from the table leaving only the success rate which we can then visualize directly.
Another useful functionality is filling empty values,
filldown which can be used to fill missing values. For example if value were missing in a bucket, we could use:
1 2 3 ... | timechart span=1m p99(upstream_response_time) as p99 | fillnull value=1000 p99
this will fill the null value in
1000 or we can use
filldown which will use the previous value for the missing values:
1 2 3 ... | timechart span=1m p99(upstream_response_time) as p99 | filldown
Timechart can be seen as a shortcut to generate charts indexed by the time.
Chart can be used to create different chart where the row index wouldn’t be the time.
Just to understand how chart works, we will be recreating the
Chart allows us construct a table indexed by the first property provided after the
1 [ BY <row-split> <column-split> ]
this means that the first property given will be the
row split and the next value will be the
Having that, we can combine it with
bin, which gives us the possibility of placing replacing the
1 | bin _time span=10m
this will replace all
_time property in each events by their respective bins with a span of 10 minutes, for example an event with a time of
8:23:24:227 AM will be changed to
8:20:00:000 AM, effectively making all events fit into bins.
We can then use
chart to split by the bins and specify the column split as the
stats_str we specified earlier:
1 2 3 4 5 ... | eval stats_str=case(status like "2%", "OK", status like "5%", "ERROR") | search stats_str!="" | bin _time span=10m | chart count by _time stats_str
We end up with a table:
This is essentially the same as:
1 2 ... | timechart span=10m count by stats_str
Another useful functionality is
table which allows us to display a table with fields.
1 2 ... | table _time, status, upstream_response_time
Although quick limited,
table is very useful to display data in a readable way in a dashboard, removing all noise from the events.
stats is used to group events and count. By using
by we can group the aggregation by specific fields, it also accepts multiple values to group by separated by a comma.
1 2 ... | stats count, p99(upstream_response_time) as p99 by status, host, request
In comparison to
stats will use the fields as column and index by the split fields. We will end up with the following table:
Today we looked at different Splunk displays, we started by looking at
timechart, exploring the different possibilities when combined with
search. We then moved on to look into
chart and see how we could replicate
bin. We then completed this post by looking into
stats where we saw that
stats provided us a way to apply aggregation functions on top of grouping of events. I hope you liked this post and I see you on the next one!