Arrays

anurag.peshne · September 23, 2017, 11:21pm

what is the closest thing to Arrays in Elixir. By arrays I mean, a container for values which I can access in constant time.

I’ve looked at tuple, but according to the documentation:

Tuples are not meant to be used as a “collection” type (which is also suggested by the absence of an implementation of the Enumerable protocol for tuples): they’re mostly meant to be used as a fixed-size container for multiple elements.

What I actually want to do:
I want to store n processes in an array and periodically pick a random process and send it a message.
I’m open to other suggestions too.

sjoconnor · September 23, 2017, 11:30pm

You’re probably looking for List. You could then do something like this to grab a random entry:

[1,2,3] |> Enum.take_random(1)

anurag.peshne · September 23, 2017, 11:31pm

Unfortunately this will traverse list linearly. I’m looking for constant time lookup.

kokolegorille · September 23, 2017, 11:41pm

You can use a map, and periodically pick a random key ?!

anurag.peshne · September 23, 2017, 11:43pm

Yes, that’s the last resort. I just want to make sure I’m not missing any data structure.

kokolegorille · September 23, 2017, 11:48pm

There is also Keyword, but the order of the key is not preserved over 32000 keys… and they act as List, so there is no constant time access.

For your use case, I would probably choose a map.

kylethebaker · September 23, 2017, 11:59pm

Does order matter? Do you need to access items by key or just at random? Map has been mentioned, but there is also MapSet if you don’t need keys. Both implement the enumerable protocol, so to get a random value you can pass it to Enum.random. What I’ve done in the past when I want constant time lookups by key and also maintain order is to create two structures: a map for the constant lookups, and a list of keys for the order.

You may also consider looking into using a Registry if you want a container for processes, though for your use case it might be easier to just keep simple collection of pids.

benwilson512 · September 24, 2017, 1:05am

You may want to elaborate on your use case. Various datastructures (including arrays) have various performance characteristics for different kinds of operations and without knowing what operations you have in mind, it’s hard to recommend the best approach.

Tuples have constant lookup time, but have to be copied wholesale for every change, which is generally undesirable.

benwilson512 · September 24, 2017, 1:24am

This seems wrong. A keyword list is just a list, and lists preserve their order at all sizes.

kokolegorille · September 24, 2017, 1:36am

@benwilson512 Yes You are right, it’s map which doesn’t keep order over 32 keys…

I have been mixing up, both types and numbers

benwilson512 · September 24, 2017, 2:01am

Ah yeah. For maps it’s best to treat them as if they have no order at all, the fact that they happen to preserve order under 32 keys is explicitly just an artifact of how the algorithm works and should never be relied upon.

anurag.peshne · September 24, 2017, 2:09am

I need a collection of process (I’m thinking of PIDs). Once I have spawned processes and added them to the collection, I won’t add any new process. But I may have to remove some process to simulate node failure, this operation would be rare and mostly the collection would be unchanged.

Also, I need to have cheap random access to elements.

benwilson512 · September 24, 2017, 2:25am

Ah yeah this changes things. You almost certainly want to use Registry, which will ensure that individual values are tied to the life cycle of the process.

Registry is backed by :ets, which is a very high performance key value store that also enables concurrent reads / writes.

NobbZ · September 24, 2017, 7:14am

As a direct answer to your question: there is the :array module from erlang.

That module uses a set of nested tuples and reading is O(1). I’m not sure about its writing runtime though, and you’d have to implement filling logic on delete by yourself (either take the last in the hole, or let every following item go one step forward, thats up to you, depending on if original order matters).

There is also the array package which is a wrapper around the erlang module, providing some protocols as well to make it easier to handle them from elixir site. But it hasn’t been updated in the last 3 years, therefore it might spit some warnings in a current elixir or not work at all.

To try to find another solution to your problem. It sounds a bit as if you want to create some kind of load balancer which is giving away its work randomly on the workers.

By picking randomly from a fixed set, you might get non-uniform distribution due to how PRNGs work.

Therefore I’d suggest one of these solutions:

Only shuffle once, use that as a base for roundrobbin. You can use :queue module for a datastructure which can pull from the head and push to the tail in amortized o(1).
Shuffle the original list and keep a copy of it around. Each time you need to send a job, use the head of the shuffled list until it is empty, then reshuffle the original list and keep going.

rvirding · September 25, 2017, 8:34pm

For what you describe a tuple actually seems like a good choice. Some questions though:

How many elements in this “array”?
How static is this “array” to be?
Is it to be globally accessed or through one process?
When randomly picking a process is it enough just to generate a random number and use that as index?
Is there any other semantics to this “array” than just storing the pids and picking one randomly?

Having very large tuples is inefficient when you update them. You basically make a copy of the whole tuple every time, but not the elements themselves of course. If it is relatively static then a tuple will work fine, doing your occasional deletes would probably work fine. Otherwise the :array module mentioned by @NobbZ is a good alternative. Maybe just using a map would work as well.

Tuples :array's and maps are local to one process so if multiple processes need to access it then perhaps using an ETS table would be better. These can be globally accessed however they have no built in transaction mechanism so you might need a process to manage them.

Qqwy · September 25, 2017, 8:46pm

I have two fine packages for you:

Arrays exposes a common array interface, with multiple different implementations that you can try if the default performance characteristics are not to your liking. (The two built-in implementations are a map-based implementation and an :arrays-based implementation.)

This is probably perfect for your needs, although the last time I had time to work on it I was unable to write proper documentation and tests, so these are still lacking.

If you need to do more mathematical stuff with them:

Tensor gives you one-dimensional vectors, two-dimensional matrices and n-dimensional tensors, and defines many common operations on them. Tensor stores these in a sparse format, which may or may not be suited well to your problem domain.

Tensor is very stable and greatly documented and tested.