I am required to work with an external web service riddled with several issues and quirks, such as not being very RESTful and especially being remarkably slow when record counts approach approx. 5,000 for no obvious reason.
A central resource in the web service provides tens of thousands of records (usually between 20k and 40k) and takes at least 15 seconds to respond (with a “won’t fix” unfortunately). The payload is in tens of megabytes.
The web service does not offer any method of querying records and no such method will be supplied in the foreseeable future.
While I’m no stranger to the fact that caching comes with some significant trade-offs, I cannot see that I have any other choice than caching considering the above.
Needs
The practical use case calls for monitoring for any changes such as updated or new records. Considering this I would like to think that somehow caching would have to be backed by rather competent persistence. Postgres would be a personal preference but I’m open to other layers being more beneficial.
While initially pondering some options I’ve thought along the lines of a recurring job that fetches all records from the web service and performs updates for changed or new records. Obviously this risks creating deadlocks and so on, but yeah.
Luckily the records include a timestamp for last update at least.
Question
Trying to be a good Ash citizen, how would one generally approach this kind of problem?
Prior art
I’ve looked through the forum and the Discord but not been able to find anything resembling the case above.
Any input regarding the general direction and any Ash conveniences would be greatly appreciated
If the data is for reference, not to be edited in your system then you can pull it at some appropriate interval in the background, compare and only insert/update new/changed records.
This should be fine to do in Postgres and unless you need the 40k records to be transactionally coherent between each other you can do this without locking everything in a single transaction.
Meaning you shouldn’t see brutal contention. The trade-off between lock time and batching updates is potentially interesting along the way. I don’t know if an update_all with a bit of upsert logic for 40k records will lock for long compared to doing the same with 40 batches of 1k.
In-memory storage like ETS is more fun but the annoying thing is that your system would cold start without having the data at all and if the data matters to the operation of the system it would be a 15s delay on start and a general nuisance.
there is a way to determine if an UPSERT was an UPDATE or an INSERT, but that requires using xmax / ctid from postgresql internally etc etc, otherwise check the updated_at timestamp?