Tuesday, October 27, 2015

When to spawn a process?

Process spawning model

One of the most popular mistakes in Erlang development is a wrong choice of process spawning model. Basically people spawn too many or too few processes. On internet there are many recommendations on when to spawn a process. One of them is "process per message", which encourages to spawn a process for each concurrent entity. It could be not completely clear or could even be misunderstood. An example below illustrates different variants of spawning model.

Public key encryption example

It's needed to build an Erlang application, which provides functionality of public key encryption. Encryption could take a significant time, and algorithm is implemented in pure Erlang. Clients of the application should call a function encrypt/1, providing a data to be encrypted as an argument. The application should manage keys without exposing that complexity to end users. Encryption key is stored in file on disk.

0 processes

The first naive implementation could be the following:
encrypt(Data) ->
  {ok, Key} = file:read_file(?KEY_FILE_PATH),
  encrypt(Key, Data).
On each encryption request key file is read and it's content is passed as argument to the algorithm together with data. Slow disk operation is probably something, we would like to minimize in our system.

1 process

The more advanced implementation is a gen_server, which caches the key in it's state in init/1 callback and does an encryption in handle_call/3.
encrypt(Data) ->
  gen_server:call(?SERVER, {data, Data}).

start_link() ->
  gen_server:start_link({local, ?SERVER}, ?MODULE, [], []).

init([]) ->
  {ok, Key} = file:read_file(?KEY_FILE_PATH),
  {ok, #state{key = Key}}.

handle_call({data, Data}, _From, State) ->
  Enc = encrypt(State#state.key, Data),
  {reply, Enc, State}.
This approach eliminates the problem with constant disk reading, but might lead to the process's queue being overwhelmed by requests, because new encryption process can not be started until previous one is finished.

1+N processes

Some people improve performance of the previous example with spawning a new process, which does the actual encryption, in handle_call.
handle_call({data, Data}, From, State) ->
  erlang:spawn(fun() ->
    Enc  = encrypt(State#state.key, Data),
    gen_server:reply(From, Enc)
  end),
  {noreply, State}.
< That speeds things up, but leads to complicated and error-prone logic in the main "dispatcher" process, which should now monitor all other process it spawns as well as take care of reporting errors back to clients.

Pool of processes

Another alternative could be using one of pooling libraries, which organizes pre-allocated workers and takes care of task dispatching. Basically worker's code is the same as in example with single process besides that it should not be registered with name neither locally nor globally.
start_link() ->
  gen_server:start_link(?MODULE, [], []).

Again 1 process

But if we analyze the original task, we see, that we cache in state only the key. So the only thing, that needs to be done in handle_call is obtaining the key, and heavy encryption algorithm call could be moved to the context of client's process.
encrypt(Data) ->
  Key = gen_server:call(?SERVER, get_key),
  encrypt(Key, Data).

handle_call(get_key, _From, State) ->
  {reply, State#state.key, State}.

Getting a key from process's state is a relatively fast operation, thereby gen_server is not a bottle neck anymore.

Conclusion

Examples illustrate how good process spawning model can lead to efficient, simple and elegant code.
The rule of thumb for spawning could be the following:
A new process should be started to serialize an access to a shared resource. The resource could be cached memory, file descriptor(socket) and so on. People from C/C++ world can think about a process as about mutex. This rule might not fit all scenarios, but could be considered as a good starting point for design. It leads to elegant and highly concurrent code.
There are some exceptions of course. First of all, "let it crash" principle, which encourages to spawn a process for a code, which likely crashes.
Also people should not mix cached memory with memory for objects (in terms of OOP). For example, in online real-time strategy game there is a number of units, which belong to a specific user. Each unit stores in memory it's state, but developer should not spawn a process per unit. In fact shared resource in that case is user's session, which is represented by tcp socket. By analogy with OS mutex it's unlikely that synchronization primitive is needed for each unit in game.
Another exception could be a situation, when it's easier to describe some algorithm as Finite State Machine (FSM). It might be reasonable to spawn a process, which implements gen_fsm behaviour.