Multi-step orchestration with 3rd party API

For context, I’m building a Phoenix application with LiveView and I’m using Ecto with Postgres.

In one part of the application the users of my application need to interact with a 3rd party API via my UI. The flow is as follows:

  1. [browser] The user picks an entity to connect to their account
  2. [server] I do two sequential call to create two records in the 3rd party system (second object needs the ID of the first one)
  3. [server] I store these two records in my database as well (for easier bookkeeping and less requests to the 3rd party API)
  4. [browser] The user is then redirected through an authentication flow with the 3rd party system
  5. [browser] After the user has authenticated, they are redirected back to my application
  6. [server] I fetch more information from the 3rd party API along with updating the two records I created in step 2

One of the problems is that I need to store the two records I created in step 2 in my database before the user has authenticated with the 3rd party API since that flow can take a while to complete (or if it even will be completed). I also need to cleanup the records in the 3rd party API if anything goes wrong during the flow and update the rows in my database.

I found Sage which seems to fit this scenario perfectly. However while developing the solution using Sage, it kept “eating” the exceptions thrown.

I’m wondering if there are any other approaches to this problem or how people have solved this in their own applications.

For cleanup there is a good thread here: How to run a job periodically

Many apps will have other jobs and things they need to schedule or background transactions they must to guarantee completion and for that Oban is an excellent library.

Thank you for the suggestion. I am familiar with periodic jobs and Oban.

As you suggest, I could run a periodic job to cleanup rows from the database. It does, somehow, feel inelegant though.

It would be inelegant only if you had a completely deterministic flow in terms of time i.e. the user will always authenticate maximum 30s after you get the tokens.

But since that’s very likely not true then having a “reaper” periodic task works best. One of the customers I worked for needed just that: users initiate actions and state is saved in the DB as a result. If the user doesn’t continue the actions in X days then a periodic task eventually deletes the DB records because it was very easy for the table in question to have 5000+ records of which only 50-60 were actually pertinent. Not a trouble at that scale but have that system run for 1-2 years and you can easily end up with 200K+ such records and at that scale querying stuff can start to slow down and affect other, more active users.

1 Like

I understand why you call periodic clean up inelegant. Care needs to be taken to avoid race conditions (used after deletion) between the front end tasks and the background task. However, it is a solvable problem and I am all ears for a better idea.

There won’t be race conditions in a transactional database if you are updating the state of those records before the “deadline” is met. Once the deadline is exceeded then the app should behave as if the records never existed. The fetching of records in the authentication path should reflect the maximum age of the record permitted, to ensure it is honouring the deadline if the cleanup job has not run yet for some reason and to avoid a transaction conflict. You can also run the cleanup job at an interval of deadline*2 or deadline+X to keep the cleanup task well outside the usage window.

Also another approach you can use if you have a fast growing table is table partitions and just drop the oldest partition each “period”. It’s much faster than deleting rows but it’s not necessary unless you’re dealing with a large number of rows. Oban uses this technique for its job queues which one of the reasons why it can handle millions of jobs per minute on very modest hardware.

Again reaching for Oban is the elegant solution.

1 Like

I have seen rare bugs when the front end did the first lookup and got some id, then later insert a record in another table that reference the first id. If the garbage collection routine sneaks in between and delete the referred row (very rare), boom, transaction failed.

The only fix I can think of is to have some built in guard band. Eg: let say rows are considered stale after one hour in “infant” state, all front end logic must assume any row in “infant” state for more than 59 minutes are invalid. However, the 1 minute guard band is still not bullet proof. What if the front end logic is taking exceptionally long?

Yes it can happen if the operations are done in different transactions.

Applications can also do SELECT FOR UPDATE to ensure they are locking those rows for update so as to cause a conflict on other transactions that attempt to touch those rows.

1 Like

It is not always possible to put the lookup and update in the same transaction. For example, there can be computation in between, or worse, 3rd party API lookup.

It sounds like you’re looking for every way to do it wrong.

It really doesn’t matter single transaction or not, if the first thing you do is select for a valid state, meaning a state within the deadline, and thus excluding invalid stale states from being progressed. If there is no known valid state available then you can’t proceed and it’s an error.

There is only an “imaginary problem” if you violate the business rule above and continue processing using a stale state well beyond the deadline. So given that’s not what the logic above does there is no problem.

If your maximum deadline for continuing the request is say 3 hours then the cleanup interval can be 6 hours because by definition the process must complete within 6 hours as even with a continuation of the process at the 2 hour 59 minute and 59 second mark will have exceeded the 3hr deadline by an additional 3hrs by the time the cleanup job tries to remove it.

If you want “extend the deadline” behavior then that first step needs to be a “touch” to kick that state along another x hours, whatever the sensible deadline needs to be to complete the process. The cleanup job will never be doing a rug pull under your feet because it is selecting on the inverse of a valid state, ie one that is always greater than the deadline for a valid state.