Matching a value in an ETS stored list

Hi community,

Being a newbie, this question may seems obvious for you, functional veterans (i’m discovering new skies with elixir and erlang and i have to admit, i like that :).

My use case is pretty close from the elixir school:
https://elixirschool.com/lessons/specifics/ets/#advanced-lookup

(actually i store an id => map which store lists)

So considering a stored tuple containing a list, what is the golden path to match a tuple whose list contains a given value ?

To follow the elixir school example, how could i retrieve people knowing Java ?

I ve read the erlang doc / stackoverflow and it seems that lists.member is not allowed in guard.

Does the solution imply reordering data in a second ETS set of known language => user ?

Thanks for your tips !!

Fred

3 Likes

You will need to retrieve the value and test it. Only calls that can be trivially inlined and are O(1) cost (or more specifically, the ones that the BEAM VM can turn into special optimized calls) can be guards. So yeah, just grab and test, or use something like PostgreSQL. :slight_smile:

1 Like

With :ets you get fast lookups if your key is part of the match.

In the elixir school example, “Java” is not part of the key but is a value. In which case ets has to do a full table scan which means it has to try and match all the rows in it. You are on the right track about reordering data. Inverting your values and keys and storing them in another ets table should give you the best performance. Also know that there is a :bag type of ets table available which can be used in this case.

I had a similar situation where I had to first lookup on a country and then city, which wouldn’t be the right fit performance wise for ets. So, I flipped the order and maintained a bag with {city, country}.

2 Likes

Would ets:foldl/3 offer any benefit?
Edit: I guess it would as you can just hand it the entire table.

1 Like

Ok, I understand the constraints on guards now. So basically I should handle a full table scan manually which does not sound that efficient if the storage is filled with few thousands of records. Unless…

… you build an index :slight_smile:

Thanks guys to bring some light on my question

Fred

2 Likes

The point with ets:foldl/3 is that you don’t have to handle the full table scan manually - the ets module does it for you (it’s still a full table scan though). You simply give it the table, an initial value (likely an empty list) and a function fun((Element :: term(), AccIn) -> AccOut). The function is presented with each element in the table in turn and it can then stuff any element “it likes” in the accumulator (e.g. the list). The completed accumulator is then returned to you.

You should be able to test your function with Lists.foldl/3.

3 Likes

If you’re thinking about indexing, you might want to consider Mnesia, because it supports that feature out of the box. As others suggested, you’ll likely need a bag table, with fields user_id (key), and language. Then, you can add a secondary index on the language field.

3 Likes

Hi peerreynders, ok thanks for the explanation, it would definitely add some flexibility (matching a value in the list is only part of the problem, other filtering criteria will also be used)

Actually to describe the big picture, as in this post, i m trying to make a proof of concept ad server.

Basically i need to filter our ad inventory (my current ets store) based on an incoming ad slot description (the matching guard / foldl function.) The filtering is done on several criteria and the server is expected to output something in 10 - 20 ms under a load of 4 - 5000 request / sec.

All in all, I assume that having several indexes or a mnesia store (which i need to test) as @sasajuric mentioned should be faster than a full scan.

So it sounds like a funny problem to try learning elixir :))

1 Like

Correct - because you need a design that is highly optimized for particular types of queries - which means that the overhead of maintaining the indices is an accepted tradeoff.

But it’s also starting to sound like a problem that isn’t going to be necessarily solved by simply using some key language/technology features. After implementing “the simplest thing that can possibly work” with, lets say mnesia, benchmark it to see how close you are getting to your required targets - it may turn out that you still need to exploit some peculiarity of your domain in order to meet your targets.

For example if the data set is huge but there are some properties of the data (or queries) that make it possible to partition the data uniformly of over multiple data sets then it may make sense to spawn multiple parallel queries over those partitions and merge their results at the end. Maybe there is some kind of pattern in the requests that a custom caching solution can exploit. I’m not at all saying that any of this applies to your particular situation - they are just examples. Done poorly or under the wrong set of circumstances tactics like this can degrade performance and make solutions unnecessarily complex.

The Fallacy of Premature Optimization
“Premature optimization is the root of all evil” vs. “Never give up your performance accidentally”

1 Like

Can’t agree more on that, actually there is already some use cases (geolocalisation, targeting by a list of exclusion…) that will need something different (probably a two phase filtering one for finite set of values that could be done by hash based on bit mask and another for non finite set) well whatever.

For now let’s try to address the easy use case, benchmark is the only justice of the peace :slight_smile:

1 Like

Yes, but please look at your original post.

and now where we have arrived at. You already had a solution in mind and asked for help with that solution - while at the same time not revealing what your problem at large actually was. Topic contributors can only effectively help you if you clearly outline the actual problem and constraints that you are working under - otherwise we all possibly end up in an X-Y Problem kind of situation.

The point is that there are probably forum members around who can give you some tips on how to establish a benchmark that gives a fair and realistic assessment of the technologies capabilities to address your particular requirements but they wouldn’t have been necessarily drawn in by your topic title or original post.

1 Like

Not sure if I’m going to be any help, but I did something similar recently. I didn’t have to match based on a specific item in a list but by the length. The match specifications used by :ets.select and others can get confusing so I just used the fun2ms in iex to create that.

Now on to your problem. I used :ets.first and :ets.next to iterate thru the table and get every set of value stores. It was an unordered set but it still reaches all the nodes when not operated on concurrently. It does take a few seconds for roughly 170k entries. After which point I just used a case statement and guards on every entry to find what I needed within the lists stored at each table entry. I just did this earlier today so it could probably be optimized further but the amount of time it takes isn’t too long for what I’m doing. If you need something in a few milliseconds, I’d go with a different approach.

1 Like

Hi @peerreynders

I understand your remark, but actually my initial question is definitely part of the problem. If i want to extend the elixir school example, I’m an intermediary recruiter, as so I have a amount of developer profiles with some attributes describing their abilities (eg known languages, country, city etc). When i receive a job offer, my job is to match a set of dev on this offer.

Even if it s a multicirteria match, the main question I had (as a newbie I repeat i m not by any mean a functional programer yet :), wanting to test something more evolved that a regular list of map/struct, was how to filter on this list stored in a map in ETS. I deliberately reduced the scope because adding “andalso” clause didn’t seem too complicated.

So considering only this scope, everyone answered fully my question (and more actually I understood the complexity limitation chosen on guard clause, had some alternative proposal, some real world experiment). For now, being still in the learning slope, I just try to address the functional problem (multicriteria match) before trying to optimize anything. My remark on benchmark is just a rule of thumb concerning optimization investigation, I m not at this stage now so this post won’t flow on this side of the topic :wink:

Thanks again

1 Like

Hey @mischief901, thanks to contribute :slight_smile:

In term of volume, I will deal with less (10x - 15 less actually) so i will definitely give a try to the foldl solution given by @peerreynders (which should have the same complexity as yours). Even if it does not fit, i will learn from this option.

Thanks again for your feedback !!

1 Like