A small story.
Keyword monitoring of logs that are completed in one line, such as apache logs, can be achieved relatively easily. For example, the apache error log is complete with one line and is monitored by keywords such as [error] and [alert] [crit].
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test
When you get a stack trace in Java, it will be multiple lines. This should be monitored with the keyword ERROR.
2017-2-01 13:33:21 [main] ERROR Unexpected Error
java.lang.NumberFormatException: For input string: "1.1"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.<init>(Unknown Source)
at ExceptionPrintDemo.formatInt(ExceptionPrintDemo.java:7)
at ExceptionPrintDemo.main(ExceptionPrintDemo.java:14)
However, when monitoring the log with the monitoring tool, only one line can be recognized, so the following stack trace cannot be notified by email. ** As a result of being notified by e-mail of the message "2017-2-01 13:33:21 [main] ERROR Unexpected Error", "Oh ..." and see the log every time on the production machine. Did you go? ** **
In addition to **, you may want to monitor not only the keywords on the first line but also the keywords on the second and subsequent lines. ** ** Also, ** on older systems, I sometimes see dates on the first line, error codes on the second line, and error messages on the third line **, which makes keyword monitoring cry.
20170201 11:12:34
ERR001
○○ The code is invalid.
20170201 11:12:34
ERR002
Unable to connect to database
This time I tried to solve that area using fluentd. The monitoring tool uses mackerel, but I think it is the same for other monitoring tools.
Use fluentd to convert to a single line of json and monitor.
curl -L https://toolbelt.treasuredata.com/sh/install-redhat-td-agent2.sh | sh
/etc/td-agent/td-agent.conf
<source>
type tail
path /var/log/app/app.log
pos_file /var/log/td-agent/app/app-log.pos
tag app.log
multiline_flush_interval 10s
format multiline
format_firstline /\d{4}-\d{1,2}-\d{1,2}/
format1 /^(?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}) \[(?<thread>.*)\](?<level>[^\s]+)(?<message>.*)/
</source>
<match app.log>
type file
path /var/log/td-agent/app/app.java.log
time_slice_wait 10m
</match>
Write a regular expression that represents one json break in format_firstline.
In format1, time, thread, and message are parsed with regular expressions.
Since it is * (?
multiline_flush_interval 10s
Every time a log flows, fluentd runs and analyzes it, but initially it is buffered in memory.
Then, after the analysis of the next log, the buffered one will be flushed. Flush this parameter every 10 seconds even if it is not extruded. It becomes the parameter.
The analyzed log is formatted as follows.
/var/log/td-agent/app/app.java.log
2017-02-01T13:33:21+09:00 app.log {"thread":"main","level":"ERROR","message":" Unexpected Error\njava.lang.NumberFormatException: For input string: \"1.1\"\n at java.lang.NumberFormatException.forInputString(Unknown Source)\n at java.lang.Integer.parseInt(Unknown Source)\n at java.lang.Integer.<init>(Unknown Source)\n at ExceptionPrintDemo.formatInt(ExceptionPrintDemo.java:7)\n at ExceptionPrintDemo.main(ExceptionPrintDemo.java:14)"}
[plugin.checks.app-log]
command = '''
check-log \
--file-pattern '/var/log/td-agent/app/app.java.log.*.log' \
--pattern '\"level\":\"ERROR\"' \
--return
'''
--return prints an error line and sends its contents to Mackerel
You should also monitor that there are two td-agent processes up.
[plugin.checks.process_td-agent]
command = "check-procs --pattern td-agent -C 2"
It's easy to forget to rotate td-agent.log. Let's set https://github.com/treasure-data/td-agent/blob/master/td-agent.logrotate
The contents of the stack trace are also output properly.
If you use fluentd like this time, it is easier to introduce and recommended. Isn't it very important in operational design whether or not such a small thing can be devised?