Search random elements in a big ETS table (> 1M)

Hey,

I need help to find a scalable and efficient solution to:

  1. Store large number of elements (username)
  2. Table is not static => element gets removed from the table when process is demonitored
  3. Being able to return random elements efficiently

The context is that I’m storing some users based on criterias in different ETS tables and wants to be able to return those in my UI as part of a search.

First thing:

I’m dealing with username as keys, and at the same time, I think the easiest way to get random elements from an ETS table is to use sequential number as keys

i.e

:ets.lookup(tab, Enum.random(1..1000)) 

So I though about using 3 ETS tables:

  1. The first to store the username with corresponding index “usernames”
  2. The second to store all the user indexes “user_indexes”
  3. the last one to increment the total count of user added to get the next index “count”

Add user scenario

  1. Get the next index from “count”
  2. Add a new index in “user_indexes”
  3. Add corresponding username in “usernames” with that index
  4. Monitor the process

That’s my initial idea but the problem is that when users gets removed from the table, the indexes are not sequential anymore ({1..4,..6..400..})

In that situation the lookup function I use to get random elements would not return the expected results

:ets.lookup(tab, Enum.random(1..200)) 

There’s an other solution which is to work directly on the username keys

first = :ets.first(tab)
:ets.lookup(tab, first)
func = fn key->
    if function_that_may_return_true() do
        key = case :ets.next(tab, key) do
         :'$end_of_table' -> throw :reached_end_of_table
         key -> func.(key)
        end
    else 
        :ets.lookup(tab, key)
    end
end

Though that solution is not efficient for large tables.

At this point I’m not sure what I could do ?

Cheers

Isn’t an ordered_set type of ets table good enough?
So have an ordered_set ets table, insert and remove usernames and randomly fetch an item based on the number of items you’ve got inserted?

How would you fetch random keys in an ETS table whose keys you don’t know ?

The only way to do that to my knowledge is to use a combination of first and next to go through the table but that is not efficient.

Let’s say I have 1M in an ordered sets

How can I get stored usernames between 700_000 and 700_070 without going through all the usernames before ?

One way to deal with those gaps would be to “draw again” when you get an unoccupied integer.

The efficiency of that approach is going to depend on how often elements get deleted; lots of churn will make for lots of retries. How common are deletes in your application?

By “draw again”, you mean re populate all those data in the table every time to fill those gaps ?

I’d say on average an user would appear in those ETS table less than a minute or so.
So those delete will be frequent, yes.

Assuming you have an ordered_set with integer key, You can do:

:ets.next(tab, Enum.random(1..1000))

If it returns end of table, do again, or do first.

I meant “draw” in the lottery sense:

  • get an index with Enum.random(1..current_max)
  • fetch the element with that key
  • if no element, try again

This would work fine for systems with infrequent deletions, since most of the time the first lookup succeeds.

It would be extremely BAD for a high-churn system, since current_max would be increasing constantly but the table is mostly unoccupied. The average runtime of a “pick a random user” would just get worse and worse…

2 Likes

That would not solve the current_max increasing constantly while the table is mostly unoccupied as pointed out by @al2o3cr

What’s the usecase here? What do you need to fetch random users out of a huge ets table for?

I’m building an app where users can search :

  1. Online users
  2. that play to certain video games
  3. Can challenge them on that game

When an user search for players, my goal is to return a list of users that is as random as possible for each search.
Because if same users are returned in search result too many times, the same users would be challenged, that would create spam, while others would be left out.

The end goal is to really create couple(1vs1) as fast as possible.

For matchmaking wouldn’t you want to prefer people having waited longer over people just having joined the queue? Also would there really be a million people waiting in the queue or rather be a million people playing? If you form couples quickly enought the queue might stay at a reasonable size.

1 Like

For matchmaking wouldn’t you want to prefer people having waited longer over people just having joined the queue

Not sure the word “queue” is the right word in that context.
Users are not really waiting in line, they are not really waiting anything.

It’s more like at some point we have an X amount of users online, ready to play a certain game, and as a user whether there’s 100_000 or less of those, I want to see a random 100 of those displayed in a list that I can scroll through .

Therefore, I keep track of all those guys at all time and remove them either when they disconnect or they have already found a partner to play with.

It’s more a status tracking across different games and levels(beginner, good, high).

Also would there really be a million people waiting in the queue or rather be a million people playing?

It could be a million users, some of them might look at the search result and scroll through it for some time (minutes) before they find someone they want to play with.

It’s difficult to predict how fast the number of users would decrease.

It depends on user’s preference and how “good” the result of the search is.
Once again, that’s why I want the result to be as different as possible for everyone.

So I guess sometimes it could be really fast to decrease and sometimes it could stagnate for some time

If you form couples quickly enought the queue might stay at a reasonable size

Well if I manage to return users list fast enough and shuffle enough so that users don’t see the same users on their screen, I definitely have a higher change to create couple faster, but as I said, then the ball is on the side of the user looking scrolling through the results returned.

If he’s not please with it, he could still click “search” again and see another set of random users.

that’s basically the goal

If there are constant churns, then current_max will be ever increasing. However, :ets.next/2 should still be log(n) time, because ordered_set is a binary tree under the hood, next works the same way regardless if the element exist or not.

If you want to keep your id space from getting too sparse, you can take a random jab into 1..current_max at allocation time and reuse that id if not occupied.

If there are constant churns, then current_max will be ever increasing. However, :ets.next/2 should still be log(n) time,

Ok, I read somewhere that it would not scale well, but I guess I should prototype and benchmark this :slight_smile:

f you want to keep your id space from getting too sparse, you can take a random jab into 1..current_max at allocation time and reuse that id if not occupied.

Yes I was thinking about storing the removed indexes and reusing them when a new user join.
that could work

Thanks

What if you were to insert into an :ordered_set table with a uniformly random key? Then the first N items in the table would be properly shuffled ahead of time. When you want to sample N items, you just choose the first N you encounter with :ets.first/:ets.next and then reshuffle them (by deleting and re-inserting).

Some disadvantages

  • deleting and re-inserting should be atomic, hence the GenServer
  • collisions theoretically possible

I guess you’d incur the same perf problems with that, just not on lookup, but on reshuffling.

With 1M items, sampling 10 takes ~100 usec, sampling 1000 takes ~3 msec.

Oh, I was reading this as the whole table would be reshuffled.

Yeah good point. My method would make it very unlikely to ever sample an item that was inserted at the end. You’re right, reshuffling the whole table would be necessary.

Lol I read the OP and knew there had to be matchmaking involved. I don’t think I fully understand the use-case here but putting that aside for a moment…

This is what I was going to suggest, but I see you figured it out first :slight_smile: I was feeling so clever, too!

But why do you have to sample from the start? Sample from a random index in the uniform space and just scan until you hit the quota. Wrap around if you hit the end.

I don’t think this is equivalent to a true random sample but for this use-case I don’t think anyone will be able to tell.

Edit: Actually if you choose 100 random indices and call :ets.next_lookup() with them (instead of scanning from the first) I think this would be a truly random sample. Right?

1..100
|> Enum.map(fn _ -> Enum.random(1..1_000_000) end)
|> Enum.map(fn i ->
  {_, [{_, value}]} = :ets.next_lookup(table, i)
  value
end)

Of course you could get duplicates but there are ways to deal with that.

Since the keys are already truly random there should be no bias from adding/removing users, no matter which users they are.

4 Likes