Friday, July 18, 2014

Misuse of environment variables

Application environment variables

Environment variables are the main configuration mechanism for Erlang applications. Configuration of Erlang node basically is a list of application names together with list of application's environment variables.
Usually variables are set once on start of application and can be read any time during application's execution and in any place in the code. It's also possible to set or update it in runtime.
Application usually uses only it's environment variables, but accessing of other application's environment is also possible.
With such a "freedom" this mechanism is often misused. Let's see an example.

Testing

In gen_server:handle_call callback environment variable is read:

handle_call(use, _From, State) ->
 {ok, Var2} = application:get_env(var2),
 {reply, Var2, State}.

It works fine until we start writing of unit tests and application:gen_env/1 returns undefined leading to process crash. Next attempt could be setting variable in fixture of test case.

basic_test_() ->
 {
 setup,
 fun() ->
   application:set_env(app_env_var, var2, 3),
   start_link()
 end,

If it's launched in gen_server's test suite without starting application it crashes again. It happens because code is executed in context of other application (application:get_application/0 returns undefined until application is started). The only reasonable fix in this case is specifying of application name on getting environment variable value.

{ok, Var2} = application:get_env(app_env_var, var2)

To conclude, getting application environment variables in low-level functions makes testing more difficult.

Another approach

In order to overcome difficulties in testing, variables could be used in a different way. All environment variables should be read in application:start/2 callback and passed further to supervisor and rest processes in the chain.

start(_StartType, _StartArgs) ->
 {ok, Var1} = application:get_env(var1),
 app_env_var_sup:start_link(Var1).

In testing variables setup is changed with passing appropriate value to start_link function of gen_server and saving in state.

basic_test_() ->
 {
 setup,
 fun() -> start_link(3) end,

This kind of environment variable usage implies additional code for passing values to the place where it is actually used. One more disadvantage is necessary restart of the node for changing variable's value.
Environment variable can be compared with global variable in languages like C++ or Java. It is accessible everywhere, what could be convenient for some task, but has all the disadvantages, which are well know. 

Performance

One more argument in favour of not using environment variables is performance. Two gen_servers, which access data from application environment variable and from it's state correspondingly , were compared. Here are results.

gs1 (value from state accessed)  : 291802
gs2 (environment variable)       : 325638

Time is specified in microseconds, results are provided for 100000 runs of the same code.

Conclusion

Code of all unit tests and benchmarking is provided in github repository.
Passing environment variables' values as arguments from application:start/2  callback results into some additional boilerplate code, requires application's restart on configuration change, but makes code much cleaner and easier to test comparing to direct reading of env variables. It is important to find balance between both approaches. 
For global functionality such as logging passing configuration to each call using additional arguments is a big overhead in terms of code readability. If configurations of some application changes often and/or it's restart is undesirable accessing env variables is preferred. In most of other cases suggested approach fits better.

Saturday, July 5, 2014

tuples vs records vs proplists

Data structures for business logic

When function is being designed, one of the most important questions is data structure to use. In Erlang there are several compound "types": lists, records, maps and tuples. ETS is separate type which has a specific usage, in most cases there is no confusion with other data structures. Maps appeared in Erlang 17 and not covered in this article even though they combine properties of both lists and tuples.
List contains variable number of elements. One of the popular "subtype" of it is proplist, which is a list of {Key, Value} tuples. It does not allow pattern matching on elements except on head.
Tuple's arity is fixed and client code in most cases should be aware of exact size. It provides intensive usage of pattern matching. With growing size usage of tuples becomes error-prone.
Records solve complexity of big tuples' usage by adding syntax sugar on compile time. In fact records are translated to tuples. Usage of records might require some additional compile dependencies.
Another alternative to tuples, records and proplists is complex function signature.
save_user(Name, Surname, Age)
where  each property is a separate argument. It's mostly equivalent to
save_user({Name, Surname, Age}).
except that if signature of function changes ofter it's much more work to adapt code comparing to single tuple argument.
In API design performance of operations on types of arguments and return values is not as crucial as simplicity of usage, protection from misuse and other criteria of "good code/design". That's why no benchmarks are provided.
Recipe of good design in this scope is quite straightforward: "Data structure shall be chosen based on it's characteristics and logic of code". It is easier to illustrate this rule with an example.

Example

Service with RESTful interface is being developed. User information is received in JSON format in POST request.
{
 "name": "Eddy",
 "surname": "Snow",
 "age": 28
}
There is a general purpose JSON object parsing function (arrays are not covered here for simplicity), it accepts binary as an argument. Data structure for return value is not that obvious.
First of all, client code needs to detect errors in parsing. Assuming that exceptions are not used, function should return tuple {ok, Result} on success or {error, Reason} on failure. Tuple for error handling here fits perfect, there is no need to introduce record since tuple size is two.
Next is data structure for Result. Since function parse_json_object is generic, it can parse any object and result  is dynamic. Proplist is suitable type for it.
-spec parse_json_object(JSON::binary()) ->
  {ok, Result::list()} | {error, Reason::term()}.

Once user information is correctly parsed, it might be needed to validate it and store in the database. So kind of internal user object is required. It could be tempting to continue using proplist, but record fits much better for it, because it applies number of compile-time checks.
-record(
  user,
  {
     name,
     surname,
     age
  }
).

Advantages and disadvantages of records

Elements of record are accessed by name and Erlang compiler verifies that only existing properties are "got/set". If tuple is used instead data could be accessed only by index of element, which is error-prone with big tuple's size.
Sometimes it's declared that records are not suitable for storing in riak in erlang binary format, because if record declaration is changed data in storage becomes invalid in terms of matching it to the new version. In fact it's worth to implement some serialisation layer for this task as versioning might also be required for proplist or tuple.
Another argument against records could be hot code upgrade because of the same problem with changing of record declaration. In that case law of "not using records as tuples" could be broken in code_change callback.

Conclusion

To sum up, general recommendation for choosing data structure are:

  1. Do not use tuples of size more than 3.
  2. Use proplists only if number of "object's" properties varies.
  3. Consider records as a main alternative to tuple of big size.