Saturday, February 28, 2015

poolboy pitfall

Description of library

Poolboy is a popular Erlang library for organisation of workers' pools. For example, it's often used for RDBMS connections. It's API is extremely simple, after start client uses just one function poolboy:transaction, which calls poolboy:checkout and poolboy:checkin wrapped into try/catch block.
transaction(Pool, Fun, Timeout) ->

    Worker = poolboy:checkout(Pool, true, Timeout),

    try

        Fun(Worker)

    after

        ok = poolboy:checkin(Pool, Worker)

    end.

Restart of terminated workers, queueing and other complicated things are completely hidden from user.

Hidden restrictions

But there is one pitfall, which developers should be aware of. By default checkout is a blocking operation( and it's used with default settings in transaction function), it means that client code will not return until worker is allocated. But nothing lasts forever, poolboy:checkout is implemented as gen_server:call/2 and has timeout argument (default is 5 seconds).
-define(TIMEOUT, 5000).

checkout(Pool, Block, Timeout) ->

    try

        gen_server:call(Pool, {checkout, Block}, Timeout)

    catch

        Class:Reason ->

            gen_server:cast(Pool, {cancel_waiting, self()}),

            erlang:raise(Class, Reason, erlang:get_stacktrace())

    end.

If timeout occurs client is "exited" with timeout reason. Attempt to handle such situation has even worse consequences.
Poolboy correctly recovers from termination of process, which checked out a worker (this test case passes). But if poolboy:checkout exits with error and client tries to handle it without termination of process, the worker might stay blocked (this test case fails).
transaction_timeout() ->

    {ok, Pid} = new_pool(1, 0),

    ?assertEqual({ready,1,0,0}, pool_call(Pid, status)),

    WorkerList = pool_call(Pid, get_all_workers),

    ?assertMatch([_], WorkerList),

    ?assertExit(

        {timeout, _},

        poolboy:transaction(Pid,

            fun(Worker) ->

                ok = pool_call(Worker, work)

            end,

            0)),

    ?assertEqual(WorkerList, pool_call(Pid, get_all_workers)),

    ?assertEqual({ready,1,0,0}, pool_call(Pid, status)). 

One can say that this issue could be experienced with enormous timeout value for checkout, but that could happen also in case of slow worker start, which is called in the same gen_server:handle_call, if overflow is allowed. Message from call might be queued for a long time, if worker is being restarted due to termination, as a result the same exit occurs on poolboy:checkout.
new_worker(Sup) ->

    {ok, Pid} = supervisor:start_child(Sup, []),

    true = link(Pid),

    Pid.

Found issue is reported to the author of poolboy together with PR, which reproduces the problem via unit test. I do not think this could be easily fixed with current architecture of poolboy, but following simple rule in client code can prevent problems.

Recommendation to avoid issues

Loosing of worker in pool could be avoided simply by not handling exits in process which call checkout. If handling of timeout on worker checkout is necessary just spawn a special process, which calls poolboy:transaction or poolboy:checkout, and "let it crash" handling it's exit as you want.