Transactional user registration

I would like to ask for a “best practice” suggestion on how to wrap a user registration process in a transaction that will either:

  • store the user data in the database AND send confirmation email

if both storing and sending is successful

  • neither store the user’s data nor send the email

if any of the two (storing data, sending email) fails

I currently use new_user_changeset to validate passed params and hash the supplied password before doing Repo.insert() and that works fine. I also created a Mailer based on “swoosh”, which I tested from the console to work as expected. Now a) where to put the Mailer invocation so that it is invoked only once the the new user is stored and b) how to wrap the process in a transaction and roll the database insert back if the mailer fails?

Sounds like a job for Ecto.Multi.run functions (3- and 5-arity). Have you looked at them?

This cannot be done. Postgres transactions can only provide guarantees between postgres operations, and sending an email isn’t a postgres operation. What you can do is put the user data in a “pending” state, and change that state to “complete” or something after the email successfully fires.

5 Likes

I am talking about framework-level transactions, not database-level ones. The framework-level transactions can (and should) utilise underlying DBMS mechanisms but rolling back can be triggered by various events. In Rails I could simply do:

User.transaction do
  user.save!
  user.send_confirmation_email!
end

regardless of which (transactions supporting) database backend is in use, Postgres or other. If the second message raises, the changes get rolled back. Or I could put the send_email_confirmation! e.g. in an after_create callback (which is transaction-wrapped) to achieve that result.

I find it hard to believe that “This cannot be done” in Phoenix.

Not yet. I understand you mean these, right? Checking out and looking for examples now.

This didn’t work in Rails either. You can do it, but it doesn’t actually work the way you think it does. The email might go out, and then the transaction might fail because the database terminates, and then the email is not unsent. This means that it is not actually transactional.

It does mean that if the email fails to send, the user will be rolled back. BUT the inverse isn’t true. The transaction might rollback after the email has gone out for any number of reasons, and the email isn’t reverted because it’s an external side effect.

Arguably, sending emails inside a transaction is actually an anti pattern because you hold a database transaction open while waiting on an external API call which is unaffected by the transaction itself.

The after create is in the transaction sure, but the mail server is not in your database. Eventually your “in transaction” calls become API or SMTP calls to the mail server, and at that point stop being managed by the transaction.

You can do the same psuedo transaction in Elixir:

Repo.transaction(fn ->
  user |> Repo.insert!
  user |> send_welcome_email!()
end)

This suffers from the same limitations though. If the sending fails, then the user rolls back, which is good. BUT if the email works, and the transaction aborts for any number of reasons, the user will be rolled back, but the email won’t be unsent. This is just how the transactions work.

9 Likes

Yep, those.

@benwilson512 is right that in principle you shouldn’t do this but I understand there are times when you would want to abuse DB transactions to try and encapsulate such more complex workflows. (You do also have to keep some transient state like “in progress” or something.)

Do take special care. These 3rd party API calls (in your case, sending an email) must finish very quickly – 2s maximum.

Perhaps you can apply the “saga” pattern to come close to what you actually want?

(Never used it, only heard about it in podcasts)

In addition to the issues already mentioned here, there is another problem. If the connection between the app server and the mail server breaks before the mail server managed to send a confirmation response, you might end up rolling back a perfectly valid transaction, while the mail is already sent. This is just a how distributed transactions work. As @benwilson512 said, you can’t use db transaction mechanism to commit or rollback the e-mail.

This is basically impossible to guarantee. If the network is a bit slower, the network request might take longer. You could try to enforce some timeout, but that will increase the likelihood of the aforementioned scenario (mail is sent while the transaction is rolled back).

IMO the proper way to tackle this is to use a db-backed queue to send mails. My current clients have switched to using oban for that. Basically, inside a db transaction an Oban job is created. Once the transaction is committed, the send mail job will be started. If this job fails (e.g. mail server is not reachable), we’ll retry it (after some delay).

A nice consequence is that with such approach we don’t need to pay so much attention in the transaction code. If you’re doing an immediate send from the transaction, you need to make sure it’s the very last thing you do (otherwise you increase the chance of e-mail being sent but the transaction failing). With the persistent queue, such issue doesn’t exist, since we enqueue the job in the same db, so we can do it anywhere inside the transaction. The job will only be executed if the transaction is committed.

Here’s another thing that won’t work reliably with immediate send:

Repo.transaction(fn ->
  do_stuff_with_db()
  send_mail_to_user_1()
  send_mail_to_user_2()
end)

If the second send fails, we’ll rollback the transaction, but one user will be notified about the success, and we can’t undo that. Using the queue approach, things will work as expected.

So tl;dr use the queue (managed in the same db) to enqueue the pending send jobs (or any other kinds of jobs which involve an external remote system).

9 Likes

True. Might also happen the other way around. I might think I “sent” the mail but I actually only placed it into send queue, which means it may still fail to send while the DB data remains stored. I am aware of that and I know it works the way I think it does :slight_smile: That was a simplified example to explain what I understand as framework-level transaction (capable of encompassing non-DB operations as potential rollback triggers).

I don’t say it’s the right approach. I asked for “best practice” suggestion so how would you suggest approaching this type of problem?

Normally they return very quickly (local stuff). but if there’s a better approach, I’d be happy to use it.

Hm. Doesn’t this suffer the same limitation as previously mentioned naive approaches? And even more? Yes, I can put another queue in the same DB and transactionally insert both the user data and sending job, but then the data is stored, and nothing knows what happens further with the job. So I might be missing something but to me it looks like adding complexity (and dependency) for doing virtually the same as sending it immediately to local sendmail (which is what I use in such situations), which maintains its own queue, does retries, etc. It could be of help if we wanted to deliver to final MX (or even to another, external SMTP) directly from the application, but I avoid such ways.

This is the inherent problem with e-mails (or any distributed system for that matter). For example, a user might have mistyped their e-mail or the mail may have ended up in spam, or it has been filtered by some custom filter, and ultimately the user never sees the e-mail.

E-mail delivery is a “best effort” operation. You can only know that the user received it if they click on some unique link in that e-mail, which will lead to a request made to the site.

So we need to keep this property in mind and design the system accordingly. For example, if the e-mail is an activation e-mail, we’ll keep the user in the activating state until they click on the link, optionally removing or invalidating the user if they don’t active within some time-frame. That is indeed added complexity, but that complexity is inherent to the challenge. You can alternatively pretend that these issues don’t exist, and it might work mostly fine, but given enough time and users these problems will surface sooner or later.

Using a local mail server will certainly reduce the chances of something going wrong, but it won’t eliminate the problems completely. If the mail-server stops, or if it’s very busy, or if the rest of the system is very busy, you still might experience a scenario where an e-mail is enqueued and delivered, but the application gets an error and rolls the transaction back. See the “Application-level failures” in this article for some real-world examples.

Beyond that, the issues mentioned by @benwilson512 still remain. You might send an e-mail successfully, but something may fail on the PostgreSQL side. To reduce the chance of this happening, you should strive to send the e-mail as the very last thing in the transaction, which is cumbersome and error-prone.

Ultimately, the naive approach increases the chances of e-mail being sent while the transaction is rolled back. I find this to be a serious issue, since we may end up misinforming the user (e.g. we tell the user that they have registered, while the database entry is in fact missing).

OTOH the persisted queue eliminates such possibility. Of course, this means you might end up saving something to the db while the e-mail is not sent, but as I already said, it’s always possible that the user won’t receive an e-mail, so you need to plan for such scenario anyway.

3 Likes

Oh, I am not saying it can be guaranteed and I fully agree with you and Ben that it’s a bad practice to pretend otherwise.

It’s just that for a lot of scenarios people can have some good probability of success and for the better or the worse, many devs go with that and get surprised in the future. :slight_smile:

Didn’t know Oban can give those guarantees though, pretty good!

Sure. The ideal option is to use something like Oban which uses postgres as a queue, and then you can in a single transaction store the the user and a job which tracks that the email should be sent. The job could of course fail, but then it will be retried. All of the retries could fail, but then at least you have a record in the DB which lets you know that it failed so that you can fix the issue and try again later.

Really, the main scenario you want to avoid is that the email goes out BUT the user row doesn’t exist. Then someone tries to sign in with a user account that doesn’t exist.

2 Likes

Yep, I second @benwilson512’s suggestion. Using something like Oban to keep the job queues in the same database allows for atomically creating the record and scheduling the job. Also, it fails in ways that can be asserted (e.g. reaching a maximum number of retries) and leave a trail that can be inspected for manual intervention.

In case the job queues are not in the same DB, scheduling the job and creating the record cannot be done atomically. In one project where delivery is quite critical (not email in my case), we separated the problem:

  • We make sure that the job queues are persistent, and that, once scheduled, a job cannot be lost.
  • After successfully scheduling the job, we mark a flag in the record. A check runs every few minutes to ensure that all records older than 1 minute successfully scheduled their job. If any record fails this check, an alert is triggered for manual inspection.
  • On the background workers side, we implemented a retry mechanism with backoff. If the maximum number of retries is reached, an alert is triggered and manual inspection is necessary.

Basically, the whole idea was to have the system handling automatically the vast majority of cases, and alert in the very few exceptional cases that cannot be automatically recovered (but can be automatically asserted). Usually, it is also necessary to decide explicitly on the trade offs (like at-least-once vs. at-most-once delivery).

I would say that the “best practice” is, in general, to be aware of the possible failing modes, explicitly decide on the trade offs, and make sure that failures are minimized but also reported. How to do that in each specific case depends on the trade offs. @benwilson512’s suggestion is probably the best way in the case presented in this thread.

4 Likes

Those are all cases I can possibly live with. If someone mistypes the email twice, sorry all bets are off. If my message landed in his spambox, sorry I can’t do anything about it either. Etc. The two cases I try to account for (as much as reasonably possible) is that only one of the two operations are executed successfully. Like I send the mail but there’s no account in the DB to be confirmed and the opposite - the account is created but mail is never sent (“sent” is what I care about, not “received” as this is out of scope here). There will always be some situations that I won’t be able to account for but that’s an “accept” type of risk resolution.

That’s how I normally do it. I have an activated_at column, which remains NULL until the account is activated. The created_at is used for counting time before inactive accounts are removed.

As I mentioned, usually the local MTA did this for me (queueing, retrying, notifications, …). I admit it may not always be the best approach but - in any case - whether this is User.transaction or something like send_email! if user.save / rescue SendmailError equivalent or any more sophisticated approach with Oban, the more Phoenix specific question from a noob in the Phoenix world is where would those who know better put the code handling these things?

It looks like the most robust thing, designed to handle such situations. But isn’t it a bit too heavy artillery for the so much typical case? How do you Phoenix experts handle it? Do you all use Oban or sage for registering users and sending activation token?

If the mail is not sent, then it is not received. If you can live with the case when the mail is not received, you can also live with the case when the mail is not sent. Hence this is not a problem:

the account is created but mail is never sent

While this is:

I send the mail but there’s no account in the DB to be confirmed

And thus the solution is to enqueue the job as a part of the db transaction, and then after the transaction is committed do the “best effort” of sending that e-mail, which means retrying to send for some reasonable amount of time.

OTOH if you insist on sending an e-mail inside the db transaction you won’t solve neither the first issue (an e-mail might still not be received, while the db record exists), nor the second one (an e-mail might be delivered but the db record is not present).

That’s all I have to say on the subject. Make of it what you will, and best of luck!

5 Likes

It might result in the same outcome, but in terms of responsibilities/control it’s different. If I did successfully send the email off to a mail provider and something happens afterwards it’s out of my control. Me not successfully sending it off to the provider on the other hand is something I’m in control about and it’s therefore on my list of responsibilities.

To answer prev. questions: I feel saga’s are interesting in more high risk scenarios like e.g. external payment gateways and such things. They’re also not a good fit, as “email” doesn’t really have great compensation actions. If an email was sent there’s no turning back. You could send another mail “Sorry, something went wrong. Your user wasn’t created”, which seems sketchy at best.

Personally I feel using oban is a good middleground: You make sure queuing the email and creating the user is transactional, so once it comes to sending the email off you’re sure the pre-conditions were successful. So the only thing you have to deal with are errors …

  • …, where your provider did already receive and send of the email. Usually sending a mail multiple times is ok(ish) and oban does retry.
  • …, where you got a success response, where there was an error. This really shouldn’t happen, but even if it does is not something in your control, but a problem with the provider. See the start of this post.
2 Likes

Exactly!

I don’t insist on anything. I admit being out of my depth on both the language (which I certainly don’t “feel” currently yet) and the framework. My typical approach in “the other” language and framework was something like I quoted above:

send_activation_email! if user.save 
[...]
rescue SendmailError => e

where I handle the unlikely situation of local error, while mail queueing, retrying, async error notification, etc. is handled by the MTA itself. Whether something like that is anywhere close to best practice/idiomatic approach in Phoenix world I have of course no idea. Hence my request for some light on how it is best/typically done in phx.