I Challenge you: Implement containers!

Qqwy · June 17, 2017, 7:37pm

tl;dr: I challenge you: let us implement containers for Elixir, and use this topic to coordinate the effort.

Hello, dear fellow Elixir developers.

Recently, multiple discussions (like this and this) have popped up about ‘Elixir not having a built-in data type XXX’.

While for some part we can rely on things that are already part of Erlang (such as :queue or :array), there are of course drawbacks to using Erlang modules:

No Protocol support.
No documentation (that is part of IEx).
No introspection, as they do not work on structs (structs are, after all, an Elixir invention).
Most of these libraries were made before Maps were a thing (Maps were introduced in OTP 17), and using maps we might create more efficient data structures (in space/time) in some cases.

On the other hand, the Elixir standard library only contains the container types Lists, Maps, and MapSets (and Range, but it is clearly the ‘odd one out’). It definitely is not the role of the Elixir standard library to provide built-in datatypes for everything, however. That is something where FOSS libraries can shine.

I myself have until now built Tensor, which introduces sparse vectors, matrices and n-dimensional tensors (collections that allow for amortized constant-time element access/update as they are built on top of maps).
I am currently in the process of creating Okasaki, which is a library containing multiple different versions of Queues and Deques (double-ended queues) with different time/memory properties.

One thing I realized, which is partly why I am starting this topic, is that the Enum and the Enumerable protocol will throw away the structure of the container you had, always returning lists as result.
I therefore am going to implement a common interface, based on the actual properties that are required to perform certain operations.

I challenge you to start implementing container data structures and release them as FOSS libraries. I think it would be wonderful to start a semi-coordinated effort through this forum, to ensure that Elixir gets a nice zoo of data structures that can be used to solve a wide variety of problems.

Suggested Container datatypes (And what libraries are working on them)

(feel free to suggest one)

Trees: (Many other container types can be built on top of trees)
- Binary trees
- Red-Black trees
- Rose trees
- 2-3 finger trees
- Patricia trees
Queues: Okasaki tries to implement multiple variants of these.
- FIFO Queues
- LIFO Stacks (List covers this somewhat, but maybe there are other ways to implement stacks with different guarantees as well? (And then have a unified interface for any kind of stack-like implementation? )
- Deques (Double-Ended Queues)
Priority Queues Prioqueue
Heaps
Sets: Implemented in Sets
- HashSet (Builtin)
- Tree-based Set
Multisets and Bags
Fast Random-Access Sequences
- Arrays implements some of these:
  - MapArray
  - :array (heap-based array)
- Something akin to Haskell’s Data.Sequence that allows O(1) access to both extremes of the container.
- Immutable arrays (á la Haskell IArray? or Erlang’s :array?)
- Someting working with DiffArrays?
Sparse Vectors/Matrices/Tensors: Tensor
Graphs
- Adjacency list
- Adjacency matrix
- Incidency matrix.

NobbZ · June 17, 2017, 8:03pm

There has been some effort put into exads, which at leasts interescted with your container proposal. Its a pity though that the effort stopped abprubtly, may be we cann pull something out of that?

Qqwy · June 18, 2017, 7:29pm

@NobbZ Thank you! I was unaware of exads. I’ll ask @sashaafm what the status of the project is. We might very well be able to pull something from there (or at least use it as inspiration).

One thing I am wondering about right now, if if it makes more sense to create one big library that contains many data structures (and possibly algorithms), or if it makes more sense to separate this in a library containing some protocols and libraries containing one or multiple implementations of data types.

NobbZ · June 18, 2017, 7:34pm

I think it does NOT make sense to put everything into a single package. Separation of concerns and stuff

I’m not sure though, how to organize packages, dependencies, structures and algorythms.

Qqwy · June 18, 2017, 7:45pm

Yes, I agree .

I think maybe the most sense would make to brush up fun_land to make sure it actually implements the ADT-hierarchy properly (which it already does, but there are some omissions) and most importantly: it becoming more clear in its usage.
It should be very friendly to use. (And maybe it needs a better name?)

The idea is to both create a set of behaviours that allows implementing the different ADT typeclasses, as well as a clear public API that internally dispatches on them, allowing functions like take_while, split, etc. to work on anything that implements the correct behaviours.

(Behaviours are used for two reasons over protocols here:

The functions that are part of the behaviour also become part of the public API of the implementor’s module. So Mappable.map(%Queue{}, fun) will dispatch to Queue.map(%Queue{}, fun), but the latter is also directly visible and introspectable.
It is possible to create default implementations for many operations if the data types implement a higher typeclass: For instance, map can be built out of pure and apply. I’d like to specify these as defoverridable to give programmers the flexibility to still implement them themselves (for e.g. extra speed). AFAIK, using defoverridable is impossible with protocols.

)

On the other hand, there are also multiple data structures that could be built on top of different practical implementations. An example: A binary tree could both be built as nested maps/tuples, or instead on top of an array (which could be built on top of a numerical-indexed map). It therefore makes sense to create a unified interface for these data types as well, and separate their external interface from their internal workings. Main drawback here might be performance, however.

rvirding · June 18, 2017, 8:23pm

Sorry I don’t quite understand what it is you want. Are you after some specific types of containers or are you after examples of different ways of implementing them? Take for examples trees, why have you given various examples of how to implement them [*]. However you look at it trees are just different ways of (efficiently ?) implementing key-value stores so why not just use a map [**]? That they are trees is just an implementation detail.

Same with the other containers. A set is set irrespective of whether it is a tree or a hash or a …

[*] Of which I have implemented most of them for fun. Data structures are fun.
[**] One reason is that maps don’t contain some access function you are after. I did this for my Lua implementation because I need a first/next function pair which allows me to step through the map. I used 2-3 trees.

Qqwy · June 18, 2017, 9:41pm

@rvirding I am interested in starting a coordinated effort to create multiple different types of containers that share common interfaces, ensuring that people can pick the most appropriate one for the job at hand.

The reason of listing multiple different possible implementation methods is to try the different possibilities and find out which one are more efficient (in space, time and developer efficiency) under certain conditions. A set can be built on top of multiple different ‘concreter’ data structures, but these different structures that are used as foundation for the set have a direct impact on the efficiency in certain conditions of the resulting set. Therefore I’d like to try multiple approaches in a coordinated fashion, to ensure a common API.

As for trees: It is true that trees are often used either as implementation for a binary heap (resulting in a priority queue) or a binary search tree (resulting in a key->value store). But there are also multiple other trees that just exist to contain relations of their own, such as Segment Trees, Rose Trees and R-trees.

Does that answer your questions?

OvermindDL1 · June 19, 2017, 3:29pm

If using them as a KV store sure, but I almost never do. Look at my ExSpirit library, it uses a tree to parse single characters at a time to efficiently get a ‘longest’ match. Doing that in maps would involve holding information on the ‘longest’ possible key or do a linear search or so and it just gets far more messy. ^.^

Honestly, I’d think all these structures are prime cases of using tuple modules. You could make a normal @behaviour and have each implement it, it would have things like map and fold_left and reduce and so forth perhaps with a use for overridable defaults based on a certain base required set, then you could use it like:

iex> q = Queue.new()
iex> q = q.push(1)
iex> q = q.push(2)
iex> q = q.push(3).push(4).push(5)
iex> {value, q} = q.pop()
iex> list_of_values = q.into([])

iex> g = Graph.new()
iex> g = g.addNode(:a).addNode(:b).addConnection(:a, :b, weight: 42).addNode(:c).addConnection(:a, :c)
iex> nodes = g.getNodes()
iex> path = g.path(from: :a, to: :b, solver: :shortest)

iex> inspect_each = fn c -> c.map(&IO.inspect/1) end
iex> inspect_each.(q) # Generic algorithms!
iex> inspect_each.(g) # Generic algorithms!

And so forth. It is using them like you use First-Class Modules in OCaml, it is a very powerful, functional, and immutable construct.

/me really really wishes people were not so scared of tuple-calls…

Qqwy · June 19, 2017, 9:53pm

2 posts were merged into an existing topic: Tuple Calls

Qqwy · June 19, 2017, 9:52pm

9 posts were split to a new topic: Tuple Calls

Qqwy · June 19, 2017, 10:11pm

NOTE: Because it became quite involved and off-topic, the discussion about Tuple Calls has been split off to its own topic, here: Tuple Calls

Instead of doing tuple calls, I can think of two ways to separate the interface from the implementation(s):

Protocols: As example, create a Queue protocol that can be implemented by any queue implementation in existence.
Drawbacks:

Non-protocol functions cannot be written in the same module, so an extra layer of indirection must be added (such as with Enumerable and Enum).
Implementations are restricted to be structs.
Protocols do not work for functions where the first argument is not the thing the protocol should dispatch on.
Creation cannot happen through the protocol directly, but only through the wrapper.
Functions that are part of the API are not part of the implementation’s API: The only way you can interact with the implementation is through the protocol. I’m not sure this is a good or a bad thing, but it definitely is a property.

Behaviours that paritally mimic protocols.
Drawbacks:

Sort of re-inventing the wheel.
To the user, a #Queue<[1,2,3]> remains a #Queue<[1,2,3]> even though the implementation might be different. (Is this good or bad?)
Possibility to use this for functions where we do not want to dispatch based on the first argument.

Hmm… after writing these out, I think that the proper approach is to go the Protocol + wrapper way. Maybe Queue.Interface for the protocol and Queue for the public API that dispatches some of its functionality to the protocol.
I furthermore think that it would be interesting and smart to create functions like:

new (data, opts \\ []) do
   impl = Keyword.get(opts, :implementation, Application.get_env(:data_types, :queue_implementation, Queue.DefaultImplementation))
   impl.new(data)
end

so people can alter the implementation that is used in the environment, or on a case-by-case basis, but they only need to be mindful about this if the default version is not to their liking.

Qqwy · June 21, 2017, 11:12am

I’ve been thinking more about the interfaces to containers.

Now, for all of the linear data types (lists, sets, multisets, queues, priority queues, arrays, maps, other kinds of associative arrays, etc.) it is easily possible to implement Collectable and Enumerable.

However, a closer look at Enumerable shows that it is not the inverse of Collectable at all: The caller is able to temporarily suspend the enumeration process, but it is impossible to only extract a single item, because the ‘enumerated-collection-in-progress’ is thrown away and only the accumulator is returned.
That is, Enumerable should maybe more accurately be named Foldable or Reducable. (It has two extra features on top of a pure reducable, which are the ability for the reducing function to stop before the end of the reducable is reached, and the ability to temporarily suspend an in-process enumeration if the caller asks for it. Only for things wrapping e.g. file handles or sockets the suspend-case might involve some extra clean-up code.)

However, there are many cases in which we are only interested in extracting one element from the current collection we have at hand. For lists, this should be the head, for a priority queue, the minimum(/maximum, depending on what kind of queue it is), for queues the front, for stacks the top, for unordered collections like sets and maps any element is equally fine, etc.

The signature of, let’s say: Extractable.extract is as follows:

extract(Extractable.t) :: {:ok, {item :: any, Extractable.t}} | :error

The implementations of this for the built-in types List and Tuple will be fast. Map and MapSet unfortunately not, as Erlang does not expose a way to get an arbitrary element from a Map, so we have to do a round-robin list conversion.

In other words: For Map and MapSet, the Enumerable implementation is basically an optimized version of wrapping Extractable!

rvirding · June 21, 2017, 9:11pm

How do you mean “arbitrary element” here? I have been requesting a function pair in the API first/next which would allow me efficiently to step through the keys of a map.

Qqwy · June 21, 2017, 9:20pm

With “arbitrary element” I mean that since map elements (above a certain size; but that is an implementation detail) are stored in ‘some’ order and thus also returned in ‘some’ order, which might differ from invocation to invocation.

So I mean: The caller has no guarantees on which element is returned first. But of course, after that, the result map does not contain this element any more, so the next time(s) the extraction function is called, a different {key, value} will be returned.

rvirding · June 21, 2017, 9:38pm

Ah. My requirements are that I need to be able to step through all the key/value pairs in a map. The order is irrelevant but I need to be sure that I will get all pairs, as long as I don’t modify the map. I also return {key,map} in my implementation in my 2-3 trees.

Qqwy · June 21, 2017, 10:10pm

I believe we are talking about the same thing, something akin to:

iex> original_map = %{a: 1, b: 2, c: 3}
iex> {:ok, {item, rest_map}} = Extractable.extract(original_map)
{:ok, {{:c, 3}, %{a: 1, b: 2}}}
iex> {:ok, {item, rest_map}} = Extractable.extract(rest_map)
{:ok, {{:a, 1}, %{b: 2}}}
iex> {:ok, {item, rest_map}} = Extractable.extract(rest_map)
{:ok, {{:b, 2}, %{}}}
iex> Extractable.extract(rest_map)
:error

rvirding · June 22, 2017, 5:17pm

Not really. One problem with your example is that you are updating the map every time which is inefficient. I was thinking something along the lines of how ETS does it with first/next:

10> {Key1,Val1} = maps:first(Map).
11> {Key2,Val2| = maps:next(Key1, Map).
12> {Key3,Val3} = maps:next(Key2, Map).
13> error = maps:next(Key3, Map).

(in Erlang) Definitely possible. ETS does it like this but only returns the key, in this case returning {Key,Value} is much better. I implemented this in my 2-3 trees as I needed it for Luerl.

Qqwy · June 27, 2017, 11:12pm

@rvirding do you know how ETS stores the internal state of the table-reading while exposing first/next? It seems something that is rather stateful. Is it something that is managed per-process somewhere?

This evening, I have written and published the first versions of Insertable and Extractable, as well as implementing them for the Okasaki Queues (which is not yet released on Hex).

net · June 28, 2017, 12:06am

You pass the last key to next(table, last_key) and ETS returns the next key according to the table’s specified or internal ordering. ETS doesn’t store the table-reading state, you do.

Qqwy · June 28, 2017, 7:04am

Thank you! I somehow missed that :ets.next took two arguments .