Hello, I’m working on a Phoenix application and I’d like to get a design advice.
I’m my application users can request to “track” an URL and get notified when it changes.
During this track request, the app attempts to download the page. If it succeeds it stores it in the database. If it fails it returns an error to the user.
This design is failing me if multiple users request to track the same URL at the same time. While the back-end is downloading the page for one request, it might get another request and I’ll end up with duplicates in the database.
I can solve this in different ways:
I could do some table locking, but it would drastically reduce concurrency
I could add a unique constraint in the database and deal with failures on duplicate insertions
I could have parallel download processes and a single process that does the insertion in the database
I could have UPSERT, if that’s supported by Ecto. I would do the download twice but at least I’d have a consistent database state.
What’s the best and the most idiomatic solution to this issue?
I would handle this in the application. Create an OTP app with a server that manages requests, spawns workers to fetch specific URLs and subscribes clients to the appropriate worker for the feedback. The worker can then check for the existence of the URL in the database and persist it if not there.
To extend @atimberlake’s answer, make just one worker for each tracked url (not sure if this is what @atimberlake meant). Put your database api call in those workers, to eliminate the race. Of course, your specific needs may require some variation on this.
Thanks @adamk, that’s exactly what I meant.
Each worker should then be able to handle multiple users/subscribers who can get notifications back about the success/failure of that URL download.
Nice. This is a elegant solution. The pattern of serializing database writes is used in a generalized manner in datomic via the Transactor. Reads can be parallelized but writes go through the Transactor. Thankfully Elixir/ erlang have all the tools we need
This is an option, but keep in mind this gets harder in a cluster. You’ve got to have a globally registered process to deal with this.
Postgres has very advanced concurrency handling built in. You can go through the effort of building your global registry and handling race conditions when multiple nodes try to claim the name and all of that, then serialize writes. Or you can just do a postgres query and let it handle it. The latter is definitely easier, and quite probably more performant.