I’m looking for the cleaner way to implement polymorphism in Ecto. My problem: I have an Ecto schema (Dataset) that I want to give a territory, this territory being composed of one or many territories, that may be Cities, Counties, State, etc, all that having their own schemas. (In reality, it’s French equivalents, but let’s keep english names and scales for clarity). So one Dataset has a territory made of (for example) one city, or ten cities, or two counties, or 1 state plus one city, etc (yes, I want to be able to mix them).
What I want to achieve:
From a Dataset, get easily it’s territory, so the collection of cities, counties, etc.
From a Dataset form, be able to edit its territory, so pick a selection of cities, counties, etc, and save them
Eventually get information from the other way around: which dataset is linked to which city.
I have read the guide here Polymorphic associations with many to many — Ecto v3.12.5 The problem: this guide explains quite easily how to implement polymorphism when you need to reach and write things from all the different objects that embed one similar object. In the guide example, you may write a todo-item either from a todo-list, either from a project, etc. Which would translate in my case to write the dataset from the city. But that’s not what I want to do.
My problem is rather the other way around: I want to group as a single concept / abstraction (“dataset territories”) things that are in reality in different schemas and relate to different tables. In the guide, it would translate to being able to tell which object (todo list or project) it belongs to from the todo_item, and write the changeset from there.
Any idea?
There are two ways of dealing polymorphism in the database:
Have multiple join tables, so dataset_cities table, a dataset_counties, and a dataset_regions.
Have a single join table, so a dataset_territory table with multiple columns (city_id, county_id, region_id) that may be nullable.
I’ve explored for now the first solution (multiple join tables) that seems the cleaner way by the guide. But the struggle is to create the single «dataset_territory» abstraction. I can just add three many_to_many relationships (many_to_many cities, many to many counties etc), and then “cast” things to these relationships from a single territory struct to write them, and rearrange them after preloading into a single struct, but then it makes a lot of noise to the dataset schema.
Perhaps should I go with the other database solution, a single table, that has a schema associated to it?
Is there any hierarchy between the different types of territories?
My approach to this would be make it a single table, just the territories table, with a type column specifying how you should consider that territory. in case of an hierarchy between the rows, i’d add a parent_id as FK to itself, making sure that a county can only have a city as parent or whatever is the hierarchy between the types.
if you really need to make them different tables, i’d make a territories virtual field in the data set and just load all three many_to_many relations in the virtual field.
Using Single Table Inheritance (adding a type column) only works well IMO if you have a high number of writes and a low numbers of reads at a given time. And if you do not care about referential integrity.
just to make it clear, i’m not talking about “polymorphic relation” where the type means which table it relates too… i’m talking about a single table.
create table("territories") do
add :type, :text, null: false
add :parent_id, references("territories")
...
end
this won’t break referential integrity and afaik doesn’t have issues with a lot of reads(but may require some enforcement of the hierarchies through application code, that would be better to be handled by the db)… and you could use recursive cte to build the entire hierarchy with a single query. I’m not sure how it would behave if the hierarchy is more shaped like a “tree”, but I worked with recursive cte before that it was a list of entities and it was fast enough.
Worth pointing out that you can have multiple different schemas mapping application concepts onto the same table. So you could always do something like @cevado 's:
OR per my proposed terminology
create table("administrative_divisions") do
add :level, :text, null: false
add :parent_id, references("administrative_divisions")
...
end
Then have multiple concrete schemas with aliases on relationships to make properties of discriminated records feel special-cased more normally:
defmodule County do
use Ecto.Schema
schema "administrative_divisions" do
# Polymorphic discriminated self-references,
# assumes they are correct in the db
belongs_to :state, State, foreign_key: :parent_id
has_many :cities, City, foreign_key: :parent_id
# Normal m2m
many_to_many :datasets, Dataset, join_through: DatasetAdministrativeDivision
end
end
# Repeat for State, City, Region, etc
You’d use these discriminated records alongside the generic ones:
# Generic non-discriminated schema
defmodule AdministrativeDivision do
use Ecto.Schema
schema "administrative_divisions" do
field :level, :string
belongs_to :parent, AdministrativeDivision, foreign_key: :parent_id
# "immediate_children" since we want to emphasize
# that this does not recurse down the tree
has_many :immediate_children, AdministrativeDivision, foreign_key: :parent_id
many_to_many :datasets, Dataset, join_through: DatasetAdministrativeDivision
end
end
defmodule DatasetAdministrativeDivision do
use Ecto.Schema
schema "dataset_administrative_divisions" do
belongs_to :dataset, Dataset
belongs_to :administrative_division, AdministrativeDivision
end
end
defmodule Dataset do
use Ecto.Schema
schema "dataset" do
many_to_many :administrative_divisions, AdministrativeDivision,
join_through: DatasetAdministrativeDivision
end
end
Effectively, what you’re doing here is saying that from a data storage perspective, these things are homogenous; but from an application perspective, they can be treated as distinct concepts, as well as homogenous ones.
Drawbacks
Ecto docs do talk about this strategy in the polymorphism docs; and has some valid critiques. Notice there is no mention of the the “discriminator” column (in my proposal, administrative_division.level, often just type) anywhere in the code above.
The problem is that you cannot have postgres manage discriminated references to enforce consistency and correctness, you have to do this at the application level. That requires a bunch more boilerplate and still never guarantees your data’s relationships are not corrupt; having concrete discriminated schemas is essentially a best-attempt and making this clearer at a type-like level.
Other languages’ ORMs (at least, many ruby and python ones) will call this “STI” (single table inheritance), if you’re looking for more reading on trade-offs and pain points.
Application-level integrity
If you want to take this approach, here is an example of some of the book-keeping you will need at the application level to keep this distinction straight.
Schema discriminator
You’ll probably want to model the :level field on your discriminated schemas, if you do, explore the :autogenerate and :writableoptions to field/3.
Query helper
You’ll probably need a way to ensure you are querying discriminated schemas correctly. I usually define a general-purpose query helper for all my schemas and use that as the entrypoint for any ecto queries I write, anyways, to apply common domain logic access patterns. It’s particularly useful in this case.
defmodule County do
import Ecto.Query
@level_discriminator "county"
@doc """
Entrypoint for correctly querying this schema.
Returns an `Ecto.Queryable`.
"""
def query, do: query(__MODULE__)
@doc """
Entrypoint for correctly querying this schema.
Accepts an id, a record, or an existing `Ecto.Queryable`. Returns an `Ecto.Queryable`.
"""
def query(%__MODULE__{id: id}) when is_binary(id) do
query(id)
end
def query(id) when is_binary(id) do
query(from record in __MODULE__, where: [id: ^id], limit: 1)
end
def query(query) do
query
|> Ecto.Queryable.to_query()
# The important bit here:
|> select_merge(%{level: @level_discriminator})
end
end
Now if we do select(from: County.query()) we should get only the correct records.
Changeset helper
You’ll want to consider how to make sure all writes also enforce the descriminator. I don’t know how these compares to/plays with some of the schema field settings mentioned above, but If you’re strictly using changesets, you probably already have a changeset helper that you can do this in, something like:
defmodule County do
import Ecto.Changeset
@level_discriminator "county"
@doc """
Entrypoint for correctly updating this schema.
Returns an `Ecto.Changeset`.
"""
def changeset(%__MODULE__{} = county, changes) when is_list(changes) do
changeset(county, Map.new(changes))
end
def changeset(%__MODULE__{} = county, changes) when is_map(changes) do
county
# Important bit
|> change(:level, @level_discriminator)
# Normal bits
|> cast(changes, allowed_fields)
|> foreign_key_constraint(etc)
|> cast_assoc(etc)
end
Coercion helpers
Now when you’re dealing with reading and writing a County directly, things should mostly work. If you use County to load relationships, they’ll have similar experiences. If you’re operating on the tree at large through the generic AdministrativeDivision, you can do that with normal nested form mechanisms, and when loading data from the tree you can use the discriminated versions.
The point of all this schema duplication effort is to never need to have to change a record back and forth between representations, only use a pure view of each depending on how you need to traverse them. However, now getting all the administrative divisions for a dataset will give you a list of generics, and once you start to do that you’re going to need to think about if, how, and when you convert between the two.
This nuance is pretty application-specific so I do not have many recommendations here.
And with all this effort, it’s still easy to accidentally mis-relate data, so again: take the Ecto docs warnings here to heart.
Of course, it depends on what hardware you are running it on, if you have writes and reads happening at the same time, the distribution of the values for the type column (e.g. if 90% of the values are “city”, an index is not going to help that much).
We were running it on a custom VM instance, I think it was one of the largest AWS offers?
True, but the thing is that I’ve never seen after working on that project databases/tables of that size, so I would personally go with this solution and migrate once the performance is not acceptable.
I would not want to be the person doing that migration. Splitting up the 100GB table into separate ones was not a fun task. How would you do that without downtime?
Copy all the data to new tables (you can even set triggers for new insert/updates, the nice thing is that we didn’t have updates/deletes).
Then the only tricky part is where you rename the old/new table, that is where you will have a few seconds of downtime. Obviously this can get more complex, but the idea is to make the migration in multiple small steps.
the nice thing is that we didn’t have updates/deletes
This is probably why you had a good time using it . It works great as an append only log.
But what if you have references inside the table to other types? Or inheritance, you will have to know the ids that are set in other tables to refer to. If it is an append only log, the migration is easy.
The database I mention had a taxonomy stored in STI (they interacted with headless Wordpress), with other tables referencing the rows in the STI table. It was a real pain to have to migrate, because these relationships had to be set correctly too, and everything had to be tested.
This is the beauty of choosing simplicity over advanced features, that database design was intentionally done like that and it served us very well.
Maybe the performance could have been better, however for that specific project, that solution was more than enough and it was especially handy for our operators, that could query the database to track when incidents happened.
You are very right! The database design should be adapted to the application’s needs and the operations that are performed on the data that is stored. OP is not talking about an append-only log, this is why I think that using STI in the OP’s case is not the best solution.
Hi all! Thanks for your answers, I’ll start reading them. Just for a start, yes, there is a kind of hierarchy between territories (but not full). But the separate tables are already there in the database, already used elsewhere in the application, and separate schemas and tables somehow make sense, as these territories have real different roles and attributes here in France (let say, a city has a mayor, a region has a president, and a mayor doesn’t have a comparable role than a president). So instead of polymorphism, I’d had to deal with Single Table Inheritance which also has its problems.
I’ll look for the virtual field, do you have any reference or example for that?
So what you are proposing is essentially what is single table inheritance in other frameworks (having a single table that is used by many schemas) for my administrative divisions (thanks a lot for the term I was searching!).
My main problem is that the app has already different tables and schemas, used elsewhere in the application since 8 years, and also that they are not fully consistent in the fields. To be honest, the main fields are consistent, it’s more on what belongs to what and some secondary fields that are not consistent. So I was not thinking about changing this (which requires some heavy refactoring of the app), but perhaps I should be looking for that.
So I have three solutions:
Keep existing distinct tables for administrative divisions, and have multiple join tables towards dataset (dataset_cities, dataset_counties, dataset_states). The problem is then on how to make a virtual field administrative_division appear on dataset, and how to make it behave correctly without it being a mess on changeset, preloads, etc.
Keep existing distinct tables for administrative divisions, and have a single dataset_administrative_divisions join table, but to avoid the problem shown in the guide of data integrity and postgreSQL references, instead of having a single join column administrative_division_id relating to many tables, have many join columns (city_id, county_id, state_id) on this single table that can be nullable. This is OK in terms of database and performance, and the guide hints it’s a possible and valid solution, but doesn’t provide code examples, so I’d be looking for that if anybody already did this.
Your solution: have a single table for all administrative divisions, which is indeed elegant, but I’d need to rewrite chunks of the 8 old app, and I’m not sure I want to do this now. Anyway, thanks a lot for all the code example! (I’m sure it will help others that have less legacy code finding this thread)
Taking that into account and looking at your initial problem statement
What is the current database state? you have the dataset table, the individual entities table and the intermediary tables, is that correct to assume? Will the “abstraction list” be used in any way to handle forms or any sort of write operation, or the abstraction is exclusively used for reading?
I’m asking that because since you mentioned that the existing entities table are somewhat coherent in their fields, you could try creating a view(or materialized view depending on your needs) with a union query… something like this, where you would have the dataset id and the common fields of this abstraction that you’re trying to do:
On the main branch, there is indeed the dataset table, and separate tables for each administrative level.
You can see it here for the Dataset module, and each of the administrative division layers, Commune, EPCI, Departement and Region. As you can see, at least the main fields (name, official INSEE code, geometry) are consistent, but there are some specific attributes, and how they belong to each other is not a simple russian doll (unfortunately).
As you see also, there are already relationships existing between these administrative divisions and dataset, but they are a mess, for some administrative layers it’s just a belongs_to, for others a many_to_many, some are missing, restraining the possibilities of what can be done. It’s normal, it’s just simple assumptions at the beggining of the project and then 8 years of development on top of it.
On my development branch, I just added one intermediary table for each administrative division level indeed, but I hesitated to have a single intermediary table (and I’ll try both solutions I think).
Both.
Reading is the most important, and it needs to be efficient (think: getting a list of datasets through the API).
The form to handle writing the relationship is only for admins, but admins are still real users and need a form. And the idea would be to manage all the association in a single form field (like, adding administrative layers like a list of tags, by their name).
What I’m trying to do right now: have a virtual field for reading, and have a smart function that casts the right association to each layer in the changeset. It may work. I hope both play together well.
Oh, that’s smart. I’ll give it a good thought. To be honest I’m reluctant to depend too much on PostgreSQL features and have my app logic split between PostgreSQL and the Elixir app, but it could be an idea to solve the fact that the legacy code has separated tables while I wish I had a single one here.
I would personally go with 3 separate relations in my ecto schemas (AKA the first approach you mentioned) + some custom logic that maps everything for presentation layer and call it a day. I personally don’t see any real benefits in trying to integrate all this logic directly in ecto.
My thoughts exactly. I’ve seen this being done before and what will happen down the line is you writing migrations that consist of a huge raw sql for every small change that happens in that view.
If referential integrity is important to you and you don’t want to have the burden of maintaining multiple tables for each relationship pair, you might want to try to use the exclusive arc modeling approach. I wrote about this on my blog some time ago, check it out: Polymorphic associations with Ecto and Postgres