Most efficient way to check for existing data entries

fklement · January 19, 2020, 5:12pm

Hello!
I’m currently having a business case where I periodically (about every 5-10 seconds) will need to check if the data that gets inserted does exists in the database. It will also be mostly the case that indeed the data was inserted before.
Now I’m questioning myself what would be the most efficient way to handle that.

Here some solutions I got up with:

*Using |> unique_constraint(:device_id) in the changeset
*Setting an index on the unique identification criteria (every tuple to be inserted has it’s own device_id)
*Trying to fetch data with the given identification criteria (device_id which is known in advance)
*Keeping a state with all currently in the db existing device_ids and checking against that

Would be cool if someone can give an elaborate explanation.

NobbZ · January 19, 2020, 5:20pm

Will not do anything if there is no constraint set/checked at the DB level

This might probably crash your process if the changeset does not know about the constraint.

So this in combination with the first sounds like a viable approach

Prone to races

As the DB has to be treatened as if there are foreign entities writing and reading to/from, you will get huge problems with keeping that cache up to date and therefore you have to combine it with another approach as well that safes you from “accidental” writes.

dimitarvp · January 19, 2020, 5:50pm

As @NobbZ said: both. That’s the only truly reliable way.

Additionally, if you want to cry a lot and fix bugs at 3:00 in the morning, you can also introduce a cache that checks if a device_id has already been inserted before any inserting agent does it. But that’s only if you don’t value your personal time.

rjk · January 19, 2020, 7:42pm

As far as I understand the context/your use case,

I would definitely go with the unique_contraint and index on the table on the device_id column to have the most simple and correct way of lookups for these devices. (first make it work and correct)

To go faster (after making it work and correct) you could deploy various techniques from very simple to more complex (giving more headaches because of higher risk of race conditions)

Simplest in my mind would be a cache with a TTL(time to live) for every record you lookup, so you’ll always get a database hit on the first try and then it ‘sticks’ for x seconds or minutes within your cache. This is only applicable if it’s allowed to let a record / device access? linger after a deletion for max TTL timeout. If this is totally now allowed then you need some mechanism to update it after a deletion (See postgres pub/sub listen/notify below)

Now if you want to make it really fast, even on a first lookup, you could create a simple map (fast lookup) on the elixir side if you don’t expect too much of them (else it would take up too much ram), now to keep this up to date you could refresh it completely every x minutes (again, depends on how big the dataset is), otherwise you could use notify/listen (if you’re using postgres) to do a very lightweight pub/sub on modifications of your table.
(see for examples here and here)

To avoid data races, best if you always do your writes directly to the database and ignore your cache. The cache should only be updated by your pub/sub mechanism or full snapshot per x seconds. And you could always treat your cache as the true positive case, as in; if you can’t find them in your cache you do the ‘slow path’ of looking them up in the database. In this case you should always be correct and fast if it’s a returning client.

I’ve used the above technique myself to cache access tokens on the elixir side (which can refresh themselves out of the http request response loop) making the API calls in the microseconds because they don’t have to do a roundtrip to the database to check if they are still allowed. In this case it doesn’t matter if somebody’s access get revoked that it takes a couple of seconds before that cache updates itself and if it’s not found in the cache I always make a database call to see if it exists. So it’s really speeding up the case where API clients return often with a call.

Conclusion, it always depends on the context and requirements but always go for working & correct first, time it and see if it’s fast enough because the database will also keep a lot of your data hot in memory (last fetched records and your index most of the time) so it should be fast enough, if not you could always put in a lot more effort

fklement · January 24, 2020, 9:49am

Wow, many thanks for this detailed insight into your opinion!
Really very interesting.