Monday, June 8, 2015

poolboy pitfall 2

The poolboy issue described in this article was successfully fixed, thereby in order to avoid it just make sure that you use at least version 1.5.1 of the library. Unfortunately another problem popped up.

Retries with poolboy

For some resources it is important to do retries within given time before reporting failure to the caller. From the first sight poolboy library provides all necessary features.

  1. Internally it uses supervisor for restarting terminated processes (for retries);
  2. Client can specify a timeout to wait for worker checkout.
So in worker's code one just needs to terminate a process if resource is not available and retries will be organised by poolboy.

handle_call(die, _From, State) ->
    {stop, {error, died}, dead, State}.

The issue

In fact the situation is a bit more complicated. Whenever gen_server callback returns a stop-tuple, it just instructs the underlying code in OTP to terminate the process, it is not stated in documentation, if caller gets response first and then the process terminates or vice versa. In addition to that poolboy's supervisor is notified about worker's termination, which is not reported to the caller without additional link/monitor.
So processes of worker's termination and checking out from the pool are not synchronized, as a result poolboy:checkout function might return a Pid of worker, which is already terminated. Further usage of the Pid will lead to exception exit: {noproc, ...}. Obviously the same could happen using poolboy:transaction function.
Client can handle this error by reporting failure, but that does not fulfill the requirement of retries for given time period.
The issue is reported in best traditions of TDD as a pull request with failing tests.

Workarounds

Since it was not fixed quickly the issue seems to be quite fundamental. In order to continue using poolboy some workaround is required.
The issue popped up, when I tried to organise retry logic by means of poolboy. An obvious workaround for this would be moving this logic to either worker or client, but my perfectionism did not allow me to expose such a complexity to that level.
Fortunately, commiters of the project advised me a simple technique to overcome the problem. Worker should check-in itself back to the pool in case of success, so that termination and check-in are synchronised.
handle_call(die, _From, State) ->
    {stop, {error, died}, dead, State};
handle_call(ok, _From, State) ->
    poolboy:checkin(pool_name, self()),
    {reply, ok, State}.
This trick implies, that poolboy:transaction can not be used anymore. It also breaks separation of concerns and abstraction, because worker starts "knowing" about the pool. But overall I find it as a "good deal" comparing to other workarounds for it's simplicity.

No comments:

Post a Comment