syslog parsing in rsyslog
Written by Rainer Gerhards
(2008-09-23)
We regularly receive messages asking why
rsyslog parses this or that message
incorrectly. Of course, it turns out that rsyslog does the right
thing, but the message sender does not. And also of course, this is not
even of the slightest help to the end user experiencing the problem ;).
So I thought I write this paper. It describes the problem source and
shows potential solutions (aha!).
Syslog Standardization
The syslog protocol has not been standardized until relatively
recently.The first document “smelling” a bit like a standard is
RFC 3164, which dates back to August
2001. The problem is that this document is no real standard. It has
assigned “informational” status by the IETF
which means it provides some hopefully useful information but does not
demand anything. It is impossible to “comply” to an informational
document. This, of course, doesn’t stop marketing guys from telling they
comply to RFC3164 and it also does not stop some techs to tell you “this
and that does not comply to RFC3164, so it is <anybody else but them>’s
fault”.
Then, there is RFC 3195, which
is a real standard. In it’s section 3 it makes (a somewhat questionable)
reference to (informational) RFC 3164 which may be interpreted in a way
that RFC3195 standardizes the format layed out in RFC 3164 by virtue of
referencing them. So RFC3195 seems to extend its standardization domain
to the concepts layed out in RFC 3164 (which is why I tend to find that
refrence questionable). In that sense, RFC3195 standardizes the format
informationally described in RFC3164, Section 4. But it demands it only
for the scope of RFC3195, which is syslog over BEEP - and NOT syslog
over UDP. So one may argue whether or not the RFC3164 format could be
considered a standard for any non-BEEP (including UDP) syslog, too. In
the strict view I tend to have, it does not. Refering to the RFC3195
context usually does not help, because there are virtually no RFC3195
implementations available (at this time, I would consider this RFC a
failure).
Now let’s for a short moment assume that RFC3195 would somehow be able
to demand RFC3164 format for non-BEEP syslog. So we could use RFC3164
format as a standard. But does that really help? Let’s cite RFC 3164,
right at the begining of section 4 (actually, this is the first
sentence):
The payload of any IP packet that has a UDP destination port of 514
MUST be treated as a syslog message.
Think a bit about it: this means that whatever is send to port 514 must be considered
a valid syslog message. No format at all is demanded. So if "this is junk" is sent to
UDP port 514 - voila, we have a valid message (interestingly, it is no longer a syslog
message if it is sent to port 515 ;)). You may now argue that I am overdoing. So let's
cite RFC 3164, Section 5.4, Example 2:
Example 2
Use the BFG!
While this is a valid message, it has extraordinarily little useful
information.
As you can see, RFC3164 explicitely states that no format at all is
required.
Now a side-note is due: all of this does not mean that the RFC3164
authors did not know what they were doing. No, right the contrary is
true: RFC3164 mission is to describe what has been seen in practice as
syslog messages and the conclusion is quite right that there is no
common understanding on the message format. This is also the reason why
RFC3164 is an informational document: it provides useful information,
but does not precisely specify anything.
After all of this bashing, I now have to admit that RFC3164 has some
format recommendations layed out in section 4. The format described has
quite some value in it and implementors recently try to follow it. This
format is usually meant when someone tells you that a software is
“RFC3164 compliant” or expects “RFC3164 compliant messages”. I also have
to admit that rsyslog also uses this format and, in the sense outlined
here, expects messages received to be “RFC3164 compliant” (knowingly
that such a beast does not exist - I am simply lying here ;)).
Please note that there is some relief of the situation in reach. There
is a new normative syslog RFC series upcoming, and it specifies a
standard message format. At the time of this writing, the main documents
are sitting in the RFC editor queue waiting for a transport mapping to
be completed. I personally expect them to be assigned RFC numbers in
2009.
Update: the numbers are now assigned and the base RFC is RFC 5424.
Practical Format Requirements
From a practical point of view, the message format expected (and
generated by default in legacy mode) is:
<PRI>TIMESTAMP SP HOST SP TAG MSG(Freetext)
SP is the ASCII “space” character and the definition of the rest of the
fields can be taken from RFC3164. Please note that there also is a lot
of confusion on what syntax and semantics the TAG actually has. This
format is called “legacy syslog” because it is not well specified (as
you know by now) and has been “inherited from the real world”.
Rsyslog offers two parsers: one for the upcoming RFC series and one for
legacy format. We concentrate on the later. That parser applies some
logic to detect missing hostnames, is able to handle various ways the
TIMESTAMP is typically malformed. In short it applies a lot of guesswork
in trying to figure out what a message really means. I am sure the
guessing algorithm can be improved, and I am always trying that when I
see new malformed messages (and there is an ample set of them...).
However, this finds its limits where it is not possible to differentiate
between two entities which could be either. For example, look at this
message:
<144>Tue Sep 23 11:40:01 taghost sample message
Does it contain a hostname? Mabye. The value “taghost” is a valid
hostname. Of course, it is also a valid tag. If it is a hostname, the
tag’s value is “sample” and the msg value is “message”. Or is the
hostname missing, the tag is “taghost” and msg is “sample message”? As a
human, I tend to say the later interpretation is correct. But that’s
hard to tell the message parser (and, no, I do not intend to apply
artificial intelligence just to guess what the hostname value is...).
One approach is to configure the parser so that it never expects
hostnames. This becomes problematic if you receive messages from
multiple devices. Over time, I may implement parser conditionals, but
this is not yet available and I am not really sure if it is needed
comlexity...
Things like this, happen. Even more scary formats happen in practice.
Even from mainstream vendors. For example, I was just asked about this
message (which, btw, finally made me write this article here):
"<130> [ERROR] iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
If you compare it with the format RFC3164 “suggests”, you’ll quickly
notice that the message is “a bit” malformed. Actually, even my human
intelligence is not sufficient to guess if there is a TAG or not (is
“[ERROR]” a tag or part of the message). I may not be the smartest guy,
but don’t expect me to program a parser that is smarter than me.
To the best of my konwledge, these vendor’s device’s syslog format can
be configured, so it would proabably be a good idea to include a
(sufficiently well-formed) timestamp, the sending hostname and (maybe?)
a tag to make this message well parseable. I will also once again take
this sample and see if we can apply some guesswork. For example, “[” can
not be part of a well-formed TIMESTAMP, so logic can conclude there is
not TIMESTAMP. Also, “[” can not be used inside a valid hostname, so
logic can conclude that the message contains no hostname. Even if I
implement this logic (which I will probably do), this is a partial
solution: it is impossible to guess if there is a tag or not
(honestly!). And, even worse, it is a solution only for those set of
messages that can be handled by the logic described. Now consider this
hypothetical message:
"<130> [ERROR] host.example.net 2008-09-23 11-40-22 PST iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
Obviously, it requires additional guesswork. If we iterate over all the
cases, we can very quickly see that it is impossible to guess everything
correct. In the example above we can not even surely tell if PST should
be a timezone or some other message property.
A potential solution is to generate a parser-table based parser, but
this requires considerable effort and also has quite some runtime
overhead. I try to avoid this for now (but I may do it, especially if
someone sponsors this work ;)). Side-note: if you want to be a bit
scared about potential formats, you may want to have a look at my paper
“On the Nature of Syslog
Data”.
Work-Around
The number one work-around is to configure your devices so that they
emit (sufficiently) well-formed messages. You should by now know what
these look like.
If that cure is not available, there are some things you can do in
rsyslog to handle the situation. First of all, be sure to read about
rsyslog.conf format and the
property replacer specifically. You need to
understand that everything is configured in rsyslog. And that the
message is parsed into properties. There are also properties available
which do not stem back directly to parsing. Most importantly, %fromhost%
property holds the name of the system rsyslog received the message from.
In non-relay cases, this can be used instead of hostname. In relay
cases, there is no cure other than to either fix the orginal sender or
at least one of the relays in front of the rsyslog instance in question.
Similarly, you can use %timegenerated% instead of %timereported%.
Timegenerated is the time the message hit rsyslog for the first time.
For non-relayed, locally connected peers, Timegenerated should be a very
close approximation of the actual time a message was formed at the
sender (depending, of course, on potential internal queueing inside the
sender). Also, you may use the %rawmsg% property together with the
several extraction modes the property replacer supports. Rawmsg contains
the message as it is received from the remote peer. In a sense, you can
implement a post-parser with this method.
To use these properties, you need to define your own templates and
assign them. Details can be found in the above-quoted documentation.
Just let’s do a quick example. Let’s say you have the horrible message
shown above and can not fix the sending device for some good reason. In
rsyslog.conf, you used to say:
Of course, things do not work out well with that ill-formed message. So
you decide to dump the rawmsg to the file and pull the remote host and
time of message generation from rsyslog’s internal properties (which,
btw, is clever, because otherwise there is no indication of these two
properties...). So you need to define a template for that and make sure
the template is used with your file logging action. This is how it may
look:
$template, MalfromedMsgFormater,"%timegenerated% %fromhost% %rawmsg:::drop-last-lf%\n"
*.* /var/log/somefile;MalformedMsgFormatter
This will make your log much nicer, but not look perfect. Experiment a
bit with the available properties and replacer extraction options to
fine-tune it to your needs.
The Ultimate Solution...
Is available with rsyslog 5.3.4 and above. Here, we can define so-called
custom parsers. These are plugin modules, written in C and adapted to a
specific message format need. The big plus of custom parsers is that
they offer excellent performance and unlimited possibilities - far
better than any work-around could do. Custom parsers can be bound to
specific rule sets (and thus listening)
ports with relative ease. The only con is that they must be written.
However, if you are lucky, a parser for your device may already exist.
If not, you can opt to write it yourself, what is not too hard if you
know some C. Alternatively, Adiscon can program one for you as part of
the rsyslog professional services
offering. In any case,
you should seriously consider custom parsers as an alternative if you
can not reconfigure your device to send decent message format.
Wrap-Up
Syslog message format is not sufficiently standardized. There exists a
weak “standard” format, which is used by a good number of
implementations. However, there exist many others, including mainstream
vendor implementations, which have a (sometimes horribly) different
format. Rsyslog tries to deal with anomalies but can not guess right in
all instances. If possible, the sender should be configured to submit
well-formed messages. If that is not possible, you can work around these
issues with rsyslog’s property replacer and template system. Or you can
use a suitable message parser or write one for your needs.
I hope this is a useful guide. You may also have a look at the rsyslog
troubleshooting guide for further help and places
where to ask questions.
Turning Lanes and Rsyslog Queues
If there is a single object absolutely vital to understanding the way
rsyslog works, this object is queues. Queues offer a variety of
services, including support for multithreading. While there is elaborate
in-depth documentation on the ins and outs of rsyslog
queues, some of the concepts are hard to grasp even for
experienced people. I think this is because rsyslog uses a very high
layer of abstraction which includes things that look quite unnatural,
like queues that do not actually queue...
With this document, I take a different approach: I will not describe
every specific detail of queue operation but hope to be able to provide
the core idea of how queues are used in rsyslog by using an analogy. I
will compare the rsyslog data flow with real-life traffic flowing at an
intersection.
But first let’s set the stage for the rsyslog part. The graphic below
describes the data flow inside rsyslog:
Note that there is a video
tutorial available on the
data flow. It is not perfect, but may aid in understanding this picture.
For our needs, the important fact to know is that messages enter rsyslog
on “the left side” (for example, via UDP), are being preprocessed, put
into the so-called main queue, taken off that queue, be filtered and be
placed into one or several action queues (depending on filter results).
They leave rsyslog on “the right side” where output modules (like the
file or database writer) consume them.
So there are always two stages where a message (conceptually) is
queued - first in the main queue and later on in n action specific
queues (with n being the number of actions that the message in
question needs to be processed by, what is being decided by the “Filter
Engine”). As such, a message will be in at least two queues during its
lifetime (with the exception of messages being discarded by the queue
itself, but for the purpose of this document, we will ignore that
possibility).
Also, it is vitally important to understand that each action has a
queue sitting in front of it. If you have dug into the details of
rsyslog configuration, you have probably seen that a queue mode can be
set for each action. And the default queue mode is the so-called “direct
mode”, in which “the queue does not actually enqueue data”. That sounds
silly, but is not. It is an important abstraction that helps keep the
code clean.
To understand this, we first need to look at who is the active
component. In our data flow, the active part always sits to the left of
the object. For example, the “Preprocessor” is being called by the
inputs and calls itself into the main message queue. That is, the queue
receiver is called, it is passive. One might think that the “Parser &
Filter Engine” is an active component that actively pulls messages from
the queue. This is wrong! Actually, it is the queue that has a pool of
worker threads, and these workers pull data from the queue and then call
the passively waiting Parser and Filter Engine with those messages. So
the main message queue is the active part, the Parser and Filter Engine
is passive.
Let’s now try an analogy analogy for this part: Think about a TV show.
The show is produced in some TV studio, from there sent (actively) to a
radio tower. The radio tower passively receives from the studio and then
actively sends out a signal, which is passively received by your TV set.
In our simplified view, we have the following picture:
The lower part of the picture lists the equivalent rsyslog entities, in
an abstracted way. Every queue has a producer (in the above sample the
input) and a consumer (in the above sample the Parser and Filter
Engine). Their active and passive functions are equivalent to the TV
entities that are listed on top of the rsyslog entity. For example, a
rsyslog consumer can never actively initiate reception of a message in
the same way a TV set cannot actively “initiate” a TV show - both can
only “handle” (display or process) what is sent to them.
Now let’s look at the action queues: here, the active part, the
producer, is the Parser and Filter Engine. The passive part is the
Action Processor. The later does any processing that is necessary to
call the output plugin, in particular it processes the template to
create the plugin calling parameters (either a string or vector of
arguments). From the action queue’s point of view, Action Processor and
Output form a single entity. Again, the TV set analogy holds. The Output
does not actively ask the queue for data, but rather passively waits
until the queue itself pushes some data to it.
Armed with this knowledge, we can now look at the way action queue modes
work. My analogy here is a junction, as shown below (note that the
colors in the pictures below are not related to the colors in the
pictures above!):
This is a very simple real-life traffic case: one road joins another. We
look at traffic on the straight road, here shown by blue and green
arrows. Traffic in the opposing direction is shown in blue. Traffic
flows without any delays as long as nobody takes turns. To be more
precise, if the opposing traffic takes a (right) turn, traffic still
continues to flow without delay. However, if a car in the red traffic
flow intends to do a (left, then) turn, the situation changes:
The turning car is represented by the green arrow. It cannot turn unless
there is a gap in the “blue traffic stream”. And as this car blocks the
roadway, the remaining traffic (now shown in red, which should indicate
the block condition), must wait until the “green” car has made its turn.
So a queue will build up on that lane, waiting for the turn to be
completed. Note that in the examples below I do not care that much about
the properties of the opposing traffic. That is, because its structure
is not really important for what I intend to show. Think about the blue
arrow as being a traffic stream that most of the time blocks
left-turners, but from time to time has a gap that is sufficiently large
for a left-turn to complete.
Our road network designers know that this may be unfortunate, and for
more important roads and junctions, they came up with the concept of
turning lanes:
Now, the car taking the turn can wait in a special area, the turning
lane. As such, the “straight” traffic is no longer blocked and can flow
in parallel to the turning lane (indicated by a now-green-again arrow).
However, the turning lane offers only finite space. So if too many cars
intend to take a left turn, and there is no gap in the “blue” traffic,
we end up with this well-known situation:
The turning lane is now filled up, resulting in a tailback of cars
intending to left turn on the main driving lane. The end result is that
“straight” traffic is again being blocked, just as in our initial
problem case without the turning lane. In essence, the turning lane has
provided some relief, but only for a limited amount of cars. Street
system designers now try to weight cost vs. benefit and create (costly)
turning lanes that are sufficiently large to prevent traffic jams in
most, but not all cases.
Now let’s dig a bit into the mathematical properties of turning
lanes. We assume that cars all have the same length. So, units of
cars, the length is alsways one (which is nice, as we don’t need to care
about that factor any longer ;)). A turning lane has finite capacity of
n cars. As long as the number of cars wanting to take a turn is less
than or eqal to n, “straigth traffic” is not blocked (or the other way
round, traffic is blocked if at least n + 1 cars want to take a
turn!). We can now find an optimal value for n: it is a function of
the probability that a car wants to turn and the cost of the turning
lane (as well as the probability there is a gap in the “blue” traffic,
but we ignore this in our simple sample). If we start from some finite
upper bound of n, we can decrease n to a point where it reaches
zero. But let’s first look at n = 1, in which case exactly one car can
wait on the turning lane. More than one car, and the rest of the traffic
is blocked. Our everyday logic indicates that this is actually the
lowest boundary for n.
In an abstract view, however, n can be zero and that works nicely.
There still can be n cars at any given time on the turning lane, it
just happens that this means there can be no car at all on it. And, as
usual, if we have at least n + 1 cars wanting to turn, the main
traffic flow is blocked. True, but n + 1 = 0 + 1 = 1 so as soon as
there is any car wanting to take a turn, the main traffic flow is
blocked (remember, in all cases, I assume no sufficiently large gaps in
the opposing traffic).
This is the situation our everyday perception calls “road without
turning lane”. In my math model, it is a “road with turning lane of size
0”. The subtle difference is important: my math model guarantees that,
in an abstract sense, there always is a turning lane, it may just be too
short. But it exists, even though we don’t see it. And now I can claim
that even in my small home village, all roads have turning lanes, which
is rather impressive, isn’t it? ;)
And now we finally have arrived at rsyslog’s queues! Rsyslog action
queues exists for all actions just like all roads in my village have
turning lanes! And as in this real-life sample, it may be hard to see
the action queues for that reason. In rsyslog, the “direct” queue mode
is the equivalent to the 0-sized turning lane. And actions queues are
the equivalent to turning lanes in general, with our real-life n being
the maximum queue size. The main traffic line (which sometimes is
blocked) is the equivalent to the main message queue. And the periods
without gaps in the opposing traffic are equivalent to execution time of
an action. In a rough sketch, the rsyslog main and action queues look
like in the following picture.
We need to read this picture from right to left (otherwise I would need
to redo all the graphics ;)). In action 3, you see a 0-sized turning
lane, aka an action queue in “direct” mode. All other queues are run in
non-direct modes, but with different sizes greater than 0.
Let us first use our car analogy: Assume we are in a car on the main
lane that wants to take turn into the “action 4” road. We pass action 1,
where a number of cars wait in the turning lane and we pass action 2,
which has a slightly smaller, but still not filled up turning lane. So
we pass that without delay, too. Then we come to “action 3”, which has
no turning lane. Unfortunately, the car in front of us wants to turn
left into that road, so it blocks the main lane. So, this time we need
to wait. An observer standing on the sidewalk may see that while we need
to wait, there are still some cars in the “action 4” turning lane. As
such, even though no new cars can arrive on the main lane, cars still
turn into the “action 4” lane. In other words, an observer standing in
“action 4” road is unable to see that traffic on the main lane is
blocked.
Now on to rsyslog: Other than in the real-world traffic example,
messages in rsyslog can - at more or less the same time - “take turns”
into several roads at once. This is done by duplicating the message if
the road has a non-zero-sized “turning lane” - or in rsyslog terms a
queue that is running in any non-direct mode. If so, a deep copy of the
message object is made, that placed into the action queue and then the
initial message proceeds on the “main lane”. The action queue then
pushes the duplicates through action processing. This is also the reason
why a discard action inside a non-direct queue does not seem to have an
effect. Actually, it discards the copy that was just created, but the
original message object continues to flow.
In action 1, we have some entries in the action queue, as we have in
action 2 (where the queue is slightly shorter). As we have seen, new
messages pass action one and two almost instantaneously. However, when a
messages reaches action 3, its flow is blocked. Now, message processing
must wait for the action to complete. Processing flow in a direct mode
queue is something like a U-turn:
The message starts to execute the action and once this is done,
processing flow continues. In a real-life analogy, this may be the route
of a delivery man who needs to drop a parcel in a side street before he
continues driving on the main route. As a side-note, think of what
happens with the rest of the delivery route, at least for today, if the
delivery truck has a serious accident in the side street. The rest of
the parcels won’t be delivered today, will they? This is exactly how the
discard action works. It drops the message object inside the action and
thus the message will no longer be available for further delivery - but
as I said, only if the discard is done in a direct mode queue (I am
stressing this example because it often causes a lot of confusion).
Back to the overall scenario. We have seen that messages need to wait
for action 3 to complete. Does this necessarily mean that at the same
time no messages can be processed in action 4? Well, it depends. As in
the real-life scenario, action 4 will continue to receive traffic as
long as its action queue (“turn lane”) is not drained. In our drawing,
it is not. So action 4 will be executed while messages still wait for
action 3 to be completed.
Now look at the overall picture from a slightly different angle:
The number of all connected green and red arrows is four - one each for
action 1, 2 and 3 (this one is dotted as action 4 was a special case)
and one for the “main lane” as well as action 3 (this one contains the
sole red arrow). This number is the lower bound for the number of
threads in rsyslog’s output system (“right-hand part” of the main
message queue)! Each of the connected arrows is a continuous thread
and each “turn lane” is a place where processing is forked onto a new
thread. Also, note that in action 3 the processing is carried out on the
main thread, but not in the non-direct queue modes.
I have said this is “the lower bound for the number of threads...”. This
is with good reason: the main queue may have more than one worker thread
(individual action queues currently do not support this, but could do in
the future - there are good reasons for that, too but exploring why
would finally take us away from what we intend to see). Note that you
configure an upper bound for the number of main message queue worker
threads. The actual number varies depending on a lot of operational
variables, most importantly the number of messages inside the queue. The
number t_m of actually running threads is within the integer-interval
[0,confLimit] (with confLimit being the operator configured limit, which
defaults to 5). Output plugins may have more than one thread created by
themselves. It is quite unusual for an output plugin to create such
threads and so I assume we do not have any of these. Then, the overall
number of threads in rsyslog’s filtering and output system is t_total
= t_m + number of actions in non-direct modes. Add the number of
inputs configured to that and you have the total number of threads
running in rsyslog at a given time (assuming again that inputs utilize
only one thread per plugin, a not-so-safe assumption).
A quick side-note: I gave the lower bound for t_m as zero, which is
somewhat in contrast to what I wrote at the begin of the last paragraph.
Zero is actually correct, because rsyslog stops all worker threads when
there is no work to do. This is also true for the action queues. So the
ultimate lower bound for a rsyslog output system without any work to
carry out actually is zero. But this bound will never be reached when
there is continuous flow of activity. And, if you are curios: if the
number of workers is zero, the worker wakeup process is actually handled
within the threading context of the “left-hand-side” (or producer) of
the queue. After being started, the worker begins to play the active
queue component again. All of this, of course, can be overridden with
configuration directives.
When looking at the threading model, one can simply add n lanes to the
main lane but otherwise retain the traffic analogy. This is a very good
description of the actual process (think what this means to the “turning
lanes”; hint: there still is only one per action!).
Let’s try to do a warp-up: I have hopefully been able to show that
in rsyslog, an action queue “sits in front of” each output plugin.
Messages are received and flow, from input to output, over various
stages and two level of queues to the outputs. Actions queues are always
present, but may not easily be visible when in direct mode (where no
actual queuing takes place). The “road junction with turning lane”
analogy well describes the way - and intent - of the various queue
levels in rsyslog.
On the output side, the queue is the active component, not the
consumer. As such, the consumer cannot ask the queue for anything (like
n number of messages) but rather is activated by the queue itself. As
such, a queue somewhat resembles a “living thing” whereas the outputs
are just tools that this “living thing” uses.
Note that I left out a couple of subtleties, especially when it
comes to error handling and terminating a queue (you hopefully have now
at least a rough idea why I say “terminating a queue” and not
“terminating an action” - who is the “living thing”?). An action
returns a status to the queue, but it is the queue that ultimately
decides which messages can finally be considered processed and which
not. Please note that the queue may even cancel an output right in the
middle of its action. This happens, if configured, if an output needs
more than a configured maximum processing time and is a guard condition
to prevent slow outputs from deferring a rsyslog restart for too long.
Especially in this case re-queuing and cleanup is not trivial. Also,
note that I did not discuss disk-assisted queue modes. The basic rules
apply, but there are some additional constraints, especially in regard
to the threading model. Transitioning between actual disk-assisted mode
and pure-in-memory-mode (which is done automatically when needed) is
also far from trivial and a real joy for an implementer to work on ;).
If you have not done so before, it may be worth reading the rsyslog
queue user’s guide, which most importantly lists all the
knobs you can turn to tweak queue operation.