How to validate both form changes as well as already persisted data?

gar · June 8, 2022, 9:24am

One of our Phoenix apps has a complicated Company model (which includes attributes and associations for things like addresses, registered owners, credit checks, etc.).

Our users use a single form to input all this information. Their workflow has three major stages:

They capture company data over several sessions (e.g. enter company name today, owner information tomorrow, edit owner information the next day). During this stage, no validation is performed on the data, all data is provisional.
At a point when all information has been entered, users can verify a company. At this point, validations are performed, e.g. the registration number must be in a certain format, an owner must be present, etc.
The users can continue to keep the company data up to date (e.g. updating the address), but all validation must continue to pass.

We are trying to use Ecto.Changeset to perform these validations, but our problem is that the changeset validations will only validate changes. If a user, at Stage 1, has entered a company name which is too long (over 50 characters), that persisted change may not be part of the changeset that is being validated at Stage 2.

We would like to use Ecto.Changeset so we can continue to run validations in Stage 3.

My assumption is that we need to add the persisted changes to the changeset for validation. What is the best way to do this? Or are their fundamental problems with this approach?

trisolaran · June 8, 2022, 10:16am

I think there are a few ways to solve your problem, but if you already have a persisted Company model and you want to validate what’s in the model in a changeset, you could cast the data into one:

Ecto.Changeset.cast(%Company{}, Map.from_struct(existing_company), Company.__schema__(:fields))
|> validate_required([:name, :registration])
|> validate_length(:registration, min: 10)
...

My assumption is that we need to add the persisted changes to the changeset for validation. What is the best way to do this? Or are their fundamental problems with this approach?

My beef with this approach is that you’re letting the user enter invalid data to begin with. I wouldn’t allow that and enforce validation from step 1.

msimonborg · June 8, 2022, 12:24pm

I agree with this comment. If a provisional stage with lenient validations is really required then I would personally prefer to use a different database table, such as “provisional_companies” or “registrations” with its own set of validation rules, keep enforcement of validations on “companies” strict

gar · June 11, 2022, 1:35pm

Thanks for your excellent input @trisolaran and @msimonborg. I’m definitly coming around to the idea of keeping two sets of models - one for the provisional company data, and the other for validated company data.

We are doing something along the lines of the code snippet you shared @trisolaran, but due to this issue and others, it feels like we are fighting the framework to get it working.

I think this is roughly the approach we will take, although instead of creating a copy of the existing “companies” schema (and therefore maintaining what are essentially two copies of the same data structures), we might end up using a simple table to capture the form details needed in Stage 1 (maybe even storing the data as a JSONB Map in the database to make future structure changes easier) and then converting that to a real company and its associations when we get to Stage 2.

Thanks again for your help!

sbuttgereit · June 11, 2022, 3:01pm

Wanted to give an alternative perspective.

So my career has been largely spent implementing, designing, and building enterprise business systems (think ERP, accounting, and the like) where processes like the one you describe are not at all uncommon: you have some business records that have what can be described as a life-cycle that moves between some sort of “authoring/not actionable” to “ready/actionable”. In my experience you wouldn’t solve the problem you describe with multiple tables representing those states.

The patterns that I see around this problem usually add a field to the most parent/header/etc. relation which captures that life-cycle state of the information. Validation is then conditionally based on this state field and checked before allowing the transition of state from the “authoring” state to the “ready” state.

Why do we do it that way? There are both conceptual and practical reasons why the single table approach with a life-cycle state column somewhere is more typical, but I’ll just mention some of the more practical reasons here.

Multiple tables related to data state are more brittle in times when the shape of the business data changes: you now have to reflect the changes in multiple tables and the routines which copy that data from State A table to State B table.
Research actions like searches may be more complex: User sees that there’s a new business to setup and asks themselves, “do we already have this business in our database?” the answer to that question now must come from two places: maybe the business is setup and complete and the new business isn’t that new after all, or maybe someone else started to set them up already… you have multiple places to look since the status is different.
Sometimes life-cycles can move beyond simple provisional/ready states. For example, the business decides that new business records need to be approved prior to being more fully usable. You still have the provisional requirement, but records should be complete prior to being made available for approval. Which table? When a business relationship ends with an external company, you often need a way to mark that kind of record as being old/deprecated/etc. or if there’s a suspension of business for some reason. So even with multiple tables representing the onboarding phase of a business relationship and the regular business phase… you still might even still need some sort of life-cycle state in that business data to represent the ending for the business relationship life-cycle and not just the starting part you’re trying to solve for now.

The downsides of single table approach are:

Your data validation is slightly more complex. You need to conditionally consider the life-cycle state of the record in addition to the state of the data you’re validating.
Assuming that the business relationships are strictly binary (they’re “provisional” or they’re “established”) then you’ll have a state field to check in other business logic indicating if the record is usable or not. In the strictly binary state case, this wouldn’t be true in the multi-table approach.
The single table size will be larger since it will contain records in progress. If onboarding process described ever gets abandoned for any reason, this could bloat the main working tables with dead data.

Anyway, in these complex sprawling business applications, we typically find that we’re dealing with a more complex life-cycle than just the provisional/ready states and that conceptually a business relationship is started once we begin the on-boarding process rather than once it’s completed. So the technical concerns and information retrieval concerns are better served by a single representation of the data rather than multiple representations depending on state.

Your problem description only included the two provisional/ready states, but having the provisional state at all is a flag that more complex business problems may be involved which is why I’m presenting the counter argument for single table approaches.