Tuesday, September 16, 2014

lower bound

Lower bound

Sometimes lower bound algorithm, which returns an ordered subset of container's elements which are greater or equal to given value, is required in development. Surprisingly there is no implementation of it in OTP.

OTP containers

In Erlang there is a number of key-value containers with efficient search operation:
  1. ETS;
  2. mnesia;
  3. gb_trees;
  4. dict;
  5. orddict.
Several implementations of sets: gb_sets, sets, ordsets, which are more or less the same as key-value container besides that only keys are present.
DETS - is a specialised file storage. Maps are not covered in this article since the appeared in Erlang 17.
Basically there are some alternatives to base lower bound algorithm implementation on.

dict, sets

In documentation it's stated that the representation of a dictionary is not defined. Thereby implementation of lower bound algorithm requires reverse engineering and would be based on data structure which might be changed in next version.

orddict, ordsets

Sorted lists from the first sight is one of the best data for lower bound implementation, but in fact lists in Erlang do not provide random access to it's elements with constant complexity (singly linked list). In ordict:find/2 source code one can see that it just traverses the entire list starting from the head. So binary search is not used, as a result not efficient lower bound implementation is possible.

ETS, mnesia

ETS and mnesia are based on native code of Erlang VM and should have the best performance. Both containers have iterators so that resulting subset of lower bound algorithm could be traversed without copying. Unfortunately in ets:match_object as well as in mnesia:match_object key must be bound for efficient search, otherwise the entire container is searched.
In order to find out if optimisation exists for range matching in ETS, test is created. Two functions are implemented: first just finds element in ETS and returns it's key (key can be used as iterator), second - returns key of first element of lower bound . Both functions are implemented based on ets:select/3 with limit 1. ETS is initialised with 20000000 elements. Each element is a tuple {Key, Value}, where Key = Value. Integers incremented from 1 to 20000000  are used as keys. In performance test both functions are called with different element to search for in the same ETS. Here are function execution times.

KeyBound key time, sRange key time,s
100000.0000.002
1000000.0000.015
10000000.0000.163
100000000.0001.332
200000000.0001.537
From provided values it's obvious that ets:select/3 makes full table scan if the key is not bound. It's important to notice that select function stops after first matching element is found because of limit argument set to one.
Although optimisation of key range matching may appear in next version of Erlang at least for some types (ordered_set), ETS and mnesia are not suitable for lower bound algorithm implementation at the moment.

gb_sets, gb_trees

gb_trees has documented representation, it's ordinary binary search tree. Iterators are also available for this container. Implementation of lower bound algorithm is very elegant with recursion, function returns iterator to the first element of subset.

Conclusion

Surprisingly none of OTP containers has an implementation lower bound algorithm, which is very common for set theory. Implementation based on ETS of type ordered_set turned out to be inefficient in terms of complexity without any obvious reasons. The most suitable container for lower bound implementation is gb_trees, where binary tree search can be used. 

2 comments:

  1. Jacob Erlbeck found in ETS documentation of ets:next/2, that "If the table is of type ordered_set, the function returns the next key in order, even if the object does no longer exist." In combination with ets:lookup/2 this function might be used for efficient lower_bound algorith implementation on ETS. Thank you, Jacob.

    ReplyDelete
  2. I have found another lower_bound implementation in the current 18.0 release: gb_sets:iterator_from/2 and gb_trees:iterator_from/2 are probably exactly what you have looked for. The iterator can then be used with next/1 for an efficient traversal of the subsequent elements.

    1> S = gb_sets:from_list([1,5,8,11,56,78,89]).
    2> I1 = gb_sets:iterator_from(10, S).
    3> {E1, I2} = gb_sets:next(I1).
    {11,..}
    4> J1 = gb_sets:iterator_from(11, S).
    5> gb_sets:next(J1).
    {11,...}

    ReplyDelete