Monday, June 16, 2014

Monitoring of Erlang systems

Monitoring

One of advantages of Erlang over other languages is scalability, which implies executing tasks on multiple machines. With growing number of servers used, monitoring of system becomes more and more complicated.

Monitoring in Erlang

What is monitoring for functional language such as Erlang? It is collecting of some attributes of function. In my opinion two most important of them are time of execution and return value.
Return value may indicate success or failure of certain operation, for example knowing if http server returned 200 or 404 is nice to track, another example is status of payment transaction: successful or declined.
Monitoring of time is important for diagnosing of performance issues, it is very useful see how long certain SQL query takes.

Possible implementations

Logging

In most projects people start adding code for monitoring directly to functions.
Fist naive implementation could be just logging of metrics. It works more or less fine except fact extracting data from logs is a tedious task. Also in most cases we are interested in relative value rather than absolute one. For example, fact that some SQL query takes 2 seconds does not tell much about system without context.

Reporting

Context in that case is history of metric, the most convenient way of representation is plot. Once people realise it, they setup up plot building software like graphitezabbix or opsview,  and implement Erlang clients for reporting. One project which is worth mentioning here is folsom, which is Erlang application for collecting metrics of a system.
On stage of integration of code, which reports data to plot building service, developers realise that almost all tests are broken, because this functionality needs to be mocked. It could be done by ordinary mocking or implementing of "empty" reporter, which does nothing.
It also turns out that monitoring code is global, and it's hard to track which metrics are monitored.

Usage of tracing

Erlang provides a convenient way of tracing software - erlang:trace function, which is reused by higher-level applications such as dbgttb and redbug. It could be also used for monitoring of system. The idea behind that is very simple. In demo project elmon tracing is enabled for all new processes spawned using this line of code:
    erlang:trace(new, true, [call, timestamp])
call option enables tracing of function calls, timestamp is just added to trace information.
Next step is in specifying functions to trace.
    MatchSpec = dbg:fun2ms(fun(_)->exception_trace()     end),
    erlang:trace_pattern(MFA, MatchSpec, [global]).
MFA here is tuple {Module, Function, Arity}, global option is specified to trace only exported functions, match specification sets up tracing of exceptional returns from function as well as normal ones.
Tracing could also be enabled for local functions, but likely it's time to refactor code when you need it. Match specification could potentially be more advanced with unbind variables ('_') and etc, but it seems to be a refactoring required instead. trace_pattern returns a number of functions matched, it is considered to be one for the same reason.
Combination of these two functions enables tracing of specified function calls in all processes spawned after. Trace information is sent to process, which called erlang:trace, in messages. Each message contains timestamp of occurred event. Events are function call and returning from it.
The only thing left to do is to store timestamp of call and subtract it from appropriate timestamp of return message. Key {Pid, MFA} is unique in scope of one call even for recursive functions, since finish message is sent after a chain of tail recursive calls is ended.
In elmon calculated trace information is reported gen_event subscribers registered for it. Monitoring may be disabled simply by not starting the application.
Usually sysops are interested in some metrics of Erlang VM such as memory usage, number of processes, etc. A separate application is a nice place to put them in.

Conclusions

Of course using of tracing has a significant drawback comparing to injecting of monitoring code directly to functions. It is performance. I would not provide any measurements, because it can vary depending on process model of software and many other factors.
Clear separation of concerns, easy enabling/disabling make Erlang tracing one of variants to consider when choosing monitoring strategy.