A case for validating uniqueness

nathanl · February 10, 2017, 3:01pm

Ecto makes a point of checking uniqueness via database constraints, not via validations. The reasoning is that if we rely on a validation, we’re open to race conditions - two simultaneous requests claim the same username, for example, because each process asks the database “is this username taken?”, each gets the answer “no”, and each then inserts it.

Ecto.Changeset has built-in validations like validate_required and validate_format which can be checked without talking to the database. If all of those pass, we attempt the insert/update. If that triggers a constraint error and we’ve told the changeset to look for it (using unique_constraint(:username) or whatever), the error gets parsed into a friendly user message.

I get all this, and I’m glad that Ecto has great support for constraints, because it’s truly impossible for application code to guarantee uniqueness without using database constraints, table locks, serializable transactions, or some other database mechanism I don’t know about.

However, I think we should also check for uniqueness (and other ‘conflicting data in the database’ situations) in validations. My reasoning is this:

If someone has taken the username I want, the odds are overwhelmingly large that a validation that queries for conflicting records would catch that. The other user almost certainly took that username 1 year ago, 1 day ago, or even 1 second ago. The only time the validation won’t catch it is if they took it (say) 1 millisecond ago, after my data was validated but before it was inserted. In other words, assuming they took it at some random time between when the application was created and now, the odds that that time was less than a millisecond ago are extremely small.
If I submit form data that would violate 3 validations and 2 constraints, I may have to submit it 3 times in a row to fix everything: the first time to fix the validations (when the constraints haven’t been checked yet), the second time to fix the first constraint error, and the third to fix the last constraint error. This is because while validations can tell us every problem they detect at once, databases (or at least PostgreSQL) only show the first constraint violation they detect; you have to fix that one and try the INSERT again to see the next constraint error. Making users submit a form where they’ve fixed all the error messages, only to present them with another round of error messages, is a bad user experience.

Therefore, if you need to ensure something like “username must be unique”, I argue that you should 1) check it in the validation phase by running a query and present the error along with things like “name can’t be blank”, and 2) also have a database constraint in case of race conditions, and use unique_constraint(:username) in your changeset function to ensure that error is handled gracefully if it happens. (If you’re both validating uniqueness and checking a constraint, the constraint error could say “whoops, looks like someone just took that username”, since that’s the only situation where they’d see it.).

The downsides of doing this are that 1) your validation phase now requires the database, which makes testing a little worse 2) it’s not very DRY 3) it’s extra work. But 1) can be dealt with by breaking having a run_database_dependent_validations function and testing it separately. 2) and 3) are developer pain that I think is justified by the better user experience.

Please add your rebuttals, questions, congratulations, murmurs of assent, etc.

nathanl · February 10, 2017, 3:18pm

Here’s an example of validating that an Event has a unique name. There may be ways to improve the code, but at least it shows what I’m talking about.

def validate_unique_name(changeset) do
  name = get_field(changeset, :name)
  if is_binary(name) do
    dups_query = from e in Event, where: e.name == ^name

    # For updates, don't flag event as a dup of itself
    id = get_field(changeset, :id)
    dups_query = if is_nil(id) do
      dups_query
    else
      from e in dups_query, where: e.id != ^id
    end

    dups = dups_query |> Reservations.Repo.all
    if Enum.any?(dups) do
      add_error(
        changeset,
        :name,
        "has already been taken",
        [validation: :validate_unique_name]
      )
    else
      # no duplicates found
      changeset
    end
  else
    # eg, name is nil
    changeset
  end
end

benwilson512 · February 10, 2017, 4:10pm

If this were to be implemented I don’t think you’d want to do it as part of the changeset functions themselves. Changing the purity of changesets is an enormous change. Rather you’d want to set some kind of callback in the changeset that would get called whenever you passed the changeset to Repo.insert or Repo.update

nathanl · February 10, 2017, 4:17pm

Changing the purity of changesets is an enormous change.

What, specifically, does it hurt? (I mentioned that it makes testing require a db, but that we could use separate functions for in-memory and database-dependent validations and test them separately.)

Rather you’d want to set some kind of callback in the changeset that would get called whenever you passed the changeset to Repo.insert or Repo.update

Ecto no longer supports callbacks, right? And what’s the advantage of doing it in one of the locations shown below vs the other?

    # here
    changeset = Event.changeset(%Event{}, event_params)

    # vs here
    case Repo.insert(changeset) do
      {:ok, _event} ->
      ...

benwilson512 · February 10, 2017, 4:23pm

You break composability.

def changeset_name(changeset, params) do
  thing
  |> cast(params, [:name])
  |> validate_unique([:name])
end

def changeset_email(changeset, params) do
  thing
  |> cast(params, [:email])
  |> validate_unique([:email])
end

if you do thing |> changeset_name(params) |> changeset_email(params) you’ve done 2 db requests when you could actually do it in just 1.

Ecto got rid of callbacks on the model / schema sure but that’s not what I’m talking about. I’m talking about using https://hexdocs.pm/ecto/Ecto.Changeset.html#prepare_changes/2 which is already the spot for doing impure stuff with changeset.s

In general though in a functional language any time you take a process that’s pure and you make it impure it’s a pretty major breaking change.

Troush · February 10, 2017, 4:26pm

In case of adding guard to race condition you can use database transaction to solve this problem, by using Ecto.Multi (https://hexdocs.pm/ecto/Ecto.Multi.html) and relaying on on_conflict call back there https://hexdocs.pm/ecto/Ecto.Repo.html#c:insert/2-options

nathanl · February 10, 2017, 4:55pm

I hadn’t seen the on_conflict option. That would be useful if the conflicting record belonged to the same user and we wanted to just update it if it exists, but if somebody else has the username I want, the system shouldn’t update their account to have my email address and password.

nathanl · February 10, 2017, 5:02pm

if you do thing |> changeset_name(params) |> changeset_email(params) you’ve done 2 db requests when you could actually do it in just 1.

So this is bad for performance? It doesn’t seem like a huge problem to me. I suppose we could do both in one query with an OR, but then you’d have to check the resulting record(s) to see which error(s) to set.

Ecto got rid of callbacks on the model / schema sure but that’s not what I’m talking about. I’m talking about using Ecto.Changeset — Ecto v3.11.1 which is already the spot for doing impure stuff with changeset.s

The docs show using prepare_changes to update the comment_count on a post when we insert a comment. It doesn’t say whether prepare_changes runs if there are already errors on the changeset (I’m guessing not?) or whether adding errors at that point would abort the insert/update.

My goal is that if somebody’s form data has a missing email address, a password that’s too short, and a username that’s taken, we show them all of these problems at the same time. I’m happy with any approach that allows that.

benwilson512 · February 10, 2017, 5:49pm

I think you’re misunderstanding me.

I think your goal here is good, and worth while. I am merely suggesting that it be implemented in such a way such that the database access happens at the same point in the code we already expect it to, not somewhere else.

Whether you personally see the benefits of that or not, surely a way to get what you want with the fewest changes to others’ expectations is desirable.

I’m not saying it has to be literally prepare changes, just that it would be LIKE prepare changes in so far as its execution model is concerned.

nathanl · February 10, 2017, 6:44pm

I think you’re misunderstanding me.

Sorry.

Ah, ok. That does make sense.

I’m not saying it has to be literally prepare changes, just that it would be LIKE prepare changes in so far as its execution model is concerned.

Gotcha. You’re saying that there could be explicit ways for Ecto to enable checking for conflicting data in validations, and that would be better than my example.

But given that I’m using Ecto 2.1.3 and it doesn’t have such an affordance, maybe the best I can do is to split my validations by type explicitly for testing? Eg:

  def run_validations(struct) do
    struct
    |> validate_in_memory
    |> validate_against_database
  end

michalmuskala · February 10, 2017, 7:04pm

Another thing to consider is what happens with API-driven applications. In that case, I would say that the uniqueness validation being done on the server is actually not a good idea. Front-end apps normally do all sorts of validations on the form inputs, and uniqueness shouldn’t be different. How would that work? Front-end issues an additional request to check for uniqueness and show the error in the UI as soon as the field is filled. This also allows to show the error right away and not after the server round-trip of full form submission.

In that case, the changeset uniqueness validation becomes truly a fallback mechanism, and it’s completely fine that it runs only after all the other validations, since it’s going to be triggered rarely.

nathanl · February 10, 2017, 7:10pm

@michalmuskala

If I understand you correctly, what you’re describing is:

User fills the form
Frontend makes an AJAX request specifically to check for uniqueness, and may set an error accordingly
When frontend says the form is valid, it sends the data to the backend,
Backend does not bother checking uniqueness in the validation step because that’s already been done, but does react as usual if there’s a constraint violation

That makes sense to me. It’s the same principle, carried further; the user typically can correct a uniqueness issue even before submitting the form.

josevalim · February 11, 2017, 11:00am

This sums it up really well. I would add:

“Making testing a little worse” may no longer be a downside on Phoenix v1.3. With the new generators we are getting rid of the pure/impure separation in controllers due to the new contexts and the fact we can now run tests that talk to the database concurrently. You will still want to write pure code, especially in views, but it is not a separation we will be pushing between controllers and models (which won’t even exist).
However, adding validate_unique makes it very hard to push people towards the proper direction. It is very easy to fallback to “bad habits” if they are made easily accessible. We saw that happening with callbacks. I can see people trying out validate_unique out of muscle memory alone. If we are going to add such, I would rather call it unsafe_validate_unique or something that makes it less reachable and more explicit.

sasajuric · February 11, 2017, 12:40pm

This is precisely how we did it in my previous company. On a change event we’d issue an ajax to validate the user name. For other inputs, we’d validate them directly on the client. For the vast majority of cases this would ensure that if the submit button is enabled, the request would succeed. Of course, you still need to verify correctness on the server side, but now you don’t need to pollute DB with the validation code.

nathanl · February 16, 2017, 9:59pm

I agree. I’ll try implementing this and get back to you.

steve · February 17, 2017, 1:31am

I’ve found myself implementing a similar function for forms that require multiple constraints, especially when a database transaction is needed to insert data across more than one table. The requirements were that the forms need to work without JavaScript, so using AJAX wasn’t an option. I’d like to share my experiences with it.

For a practical example, let’s take a look at the use case in @josevalim’s schemaless query blog post under the Schemas are Mappers section. We have a Registration form that splits data between a Profile and an Account. Let’s say that the profile schema also has a username that needs to be unique. If the email and the username turned out to be taken already, it would be nice to show the user both errors at once instead of making them submit the form twice.

I’m making the following function up off the top of my head, so please excuse any errors or inefficiency in it, but the idea for a function that validates uniqueness would be along the lines of the following:

defp validate_unique_email(%Ecto.Changeset{} = changeset, options \\ [])
    when is_list(options) do
  if changeset.errors[:email] do
    changeset
  else
    if is_email_unique?(changeset.changes.email) do
      changeset
    else
      put_unique_validation_error(changeset, :email, options)
    end
  end
end

Keep in mind, the changeset has already been validated before entering this function; it is local to the registration module. If there is already an existing error on the email field, the database call is bypassed. If there is no pre-existing error, the is_email_unique? function hits the database returns a boolean about whether or not the email already exists. If the email is unique, the changeset is simply passed along. If the email already exists, the utility function put_unique_validation_error adds the validation error to the changeset (passing along the options, such as options[:message]). The username would go through a similar process.

Since there are a couple times uniqueness is validated, the put_unique_constraint_error function is generalized to allow developers to easily add unique constraint violations to the changeset. (Speaking of, there is no error information added to unique constraints in Ecto outside of the messages at the moment. For example, a length validation error adds [validation: :length] to the error. It might be a good idea to add something like [constraint: :unique] or [constraint: :foreign_key] for when the constraints fail.)

After that, when the data is actually inserted into the database, both fields still have the unique_constraint placed on both of them before being inserted. This prevents any sort of race from happening.

This is a rare use case, but one I found myself needing. It would be nice to have it built into Ecto, but I agree that it should stay outside of Changeset. If there’s only a single constraint, the normal constraints more than suffice.

Edit: I do want to say that adding this functionality to my application was trivial. There were only two separate cases where it was needed, and the context was bound within the module responsible for that particular form. Because of that, it was very clear that this check was hitting the database to do the validation. The ease of implementation and clarity of action could be used to argue that the functionality doesn’t really need to be added since it’s such an edge case. If it were to be added, I see Jose’s recommendation of adding it as something called unsafe_validate_unique as a good middle ground.

anders · February 19, 2017, 5:58am

The general version of the problem you describe has been discussed at length in the DDD community under the rubric of “cross aggregate rule validation”. A number of solutions have been suggested. You might get some ideas looking in to those discussions.

Here is a good place to start that discusses aggregate design and rule validations: http://dddcommunity.flywheelsites.com/library/vernon_2011/

nathanl · December 11, 2018, 9:35pm

FWIW, this discussion eventually led to the creation of https://hexdocs.pm/ecto/Ecto.Changeset.html#unsafe_validate_unique/4

A case for *validating* uniqueness

A case for validating uniqueness