Advanced subsearches and transactions in Splunk: Tracing qmail deliveries
Yesterday, a customer called and asked if we could trace the delivery of a particular e-mail to one of their clients. We're using Splunk to collect all of our logs, including recent email metadata. So this question should be easy to answer... you might think.Brace yourself for Splunk subsearches and transactions.
In this post
The email in question was part of a larger mail processing job, and we're using qmail to process these mails. Yes, qmail – it works great when it comes to doing high-volume, outbound-only deliveries in short time.
The problem challenge is that qmail has an interesting way of logging in the current
log, which looks like this:
@400000005aa66b052a527324 new msg 33778541@400000005aa66b052a527edc info msg 33778541: bytes 7703 from <sender@host.tld> qp 21534 uid 64011@400000005aa66b052a840e5c starting delivery 7512293: msg 33778541 to remote recipient@host.tld@400000005aa66b060a418aac delivery 7512293: success: 176.34.178.125_accepted_message./Remote_host_said:_250_OK_id=1evM4J-0005W8-QC/@400000005aa66b060a419a4c end msg 33778541
I am not talking about the funny-looking tai64 timestamps, but rather the message and delivery ids. There are several problems with this:
- The message id is based on the Linux filesystem inode id for the mail file sitting in the queue. While being unique at a given time, multiple different mails will use the same message id over time.
- The delivery id is just a counter that increments with every message processed. It will start from scratch if you restart qmail, and so again, this id is not unique over a longer time.
- While you get the information that a particular delivery has been started for a given message id, all further information regarding the progress of this delivery is logged only with the delivery id, but does not show the message id again. That's probably due to the way the qmail architecture uses different processes for isolated tasks.
In order to get a comprehensive Splunk report for a given email address and to make it run in acceptable time, I had to learn about Splunk subsearches and transaction grouping.
Use a subsearch to narrow down relevant events
First, lets start with a simple Splunk search for the recipient address.
index=mail sourcetype=qmail_current recipient@host.tld
In particular, this will find the starting delivery
events for this address, like the third log line shown above. Having done our homework, Splunk extractions are set up in a way that we get the qmail_msg
and qmail_delivery
fields for this event.
Now, in order to get a complete report including delivery progress, we need to consider all log events that include either the appropriate message id or delivery id.
With a default Splunk subsearch, the outer search will get all events where every field returned from the subsearch matches.
This works because Splunk applies the format
command implicitly on subsearches. Try this:
index=mail sourcetype=qmail_current recipient@host.tld | fields qmail_msg qmail_delivery | format
This will return a single event with a field named search
and a value like
( ( qmail_delivery="8227046" AND qmail_msg="33565415" ) OR ( qmail_delivery="7947353" AND qmail_msg="33719121" ) OR ...)
Splunk will first execute the subsearch. Then, the value from this search
field is taken as a replacement for the subsearch part of the query. Finally, the resulting query is executed.
You can, in fact, put the format command in your subsearch yourself and use parameters to modify the resulting string. Let's do this and directly combine it with a subsearch:
index=mail sourcetype=qmail_current [ search index=mail sourcetype=qmail_current recipient@host.tld | fields qmail_msg qmail_delivery | format "" "" "OR" "" "OR" ""]
This search fetches all log events that either have a message id or a delivery id for any message or delivery ids that appears in context with the recipient address.
Group events as transactions
You can then use the transaction command to group events. As you can see from the log excerpt above, qmail transactions start with new msg
and end with end msg
. We need to pick the message id from the start event and also include all events that have this message id and occur before the end event.
Additionally, the delivery id makes up the transaction. The transaction command is smart enough to pick up any delivery id that appears together with the message id we're following, and it can use this delivery id to include further events, even if they lack the message id. The transaction documentation has an example for this.
Adding a maximum duration between single transaction events for added performance, this gives us
| transaction qmail_msg qmail_delivery startswith="new msg" endswith="end msg" maxpause=1h
Now the last step is to once again filter out transactions that do not contain our intended recipient. Those transactions may show up because, as I stated in the beginning, the ids used by qmail are not unique. So while the subsearch finds the right ids for our particular recipient, the outer search may produce too many intermediate results.
We remove those transactions by applying a final
| search recipient@host.tld
The result
Putting it all together, we get
index=mail sourcetype=qmail_current [ search index=mail sourcetype=qmail_current recipient@host.tld | fields qmail_msg qmail_delivery | format "" "" "OR" "" "OR" "" ] | transaction qmail_msg qmail_delivery startswith="new msg" endswith="end msg" maxpause=1h | search recipient@host.tld
Most Splunk queries like this seem to do magic. And once they work, you tend to forget how they work or why they probably produce correct results. In part, this is why I am writing this up as a note to my future self.
Also, because a query like this is hard to remember or to re-construct, I saved this search as a Splunk dashboard. That allows you to add a handy input box for the address et voilà, we have a simple-to-use interface for our next support call.