Design discussion - Ecto is your application

Sorc96 · November 10, 2024, 1:29pm

Hello, everyone! This should probably be a blog post, but I don’t have a blog, so here we go My thoughts and frustrations regarding domain modeling in the face of persistence, distilled into a few paragraphs. I’m interested in your opinions.

This is probably going to get a bit philosophical, but hopefully also stay practical. I’d like to talk about the age old problem of ORMs. And when it comes to the impedance mismatch problem, Ecto is in the exact same situation as any other ORM in other technologies. For that reason, I’m going to refer to Ecto as an ORM in this post.

The impedance mismatch

The so-called object-relational impedance mismatch is a misnomer (like so many things in computing). The same problem arises without objects, because the mismatch really exists between graphs and relations.

Graphs are incredibly useful for adding meaning to the data, allowing us to work effectively in a given context. On the other hand, the relational representation is removed from any context, which makes it a great choice for storing data. That way, we can easily add new behaviour even if it needs to shape the data very differently.

But this versatility is both a blessing and a curse. It is a Jack of all trades, but a master of none. Therefore, we either accept significantly worse design of the application code, or end up mapping the relational representation into graphs based on our current use case.

A practical example

I’m going to show a simple example to illustrate the problem of designing code in fear of persistence. The idea is inspired by a video about DDD I watched recently.

Let’s say we need to work with books that have an identifier, a title and an edition. We can ignore everything else to focus on the actual issue. An edition can either be ordinal, identified with a number, or seasonal, identified by a season and year.

Design without persistence

In a world ignorant of persistence, where we can use the full power of idiomatic Elixir, the given description could easily translate into something like this:

defmodule Bookstore.Book do
  defstruct [:id, :title, :edition]
end

defmodule Bookstore.Edition do
  @seasons [:spring, :summer, :autumn, :winter]

  def ordinal(n) when is_integer(n), do: {:ordinal, n}

  def seasonal(season, year) when season in @seasons and is_integer(year) do
    {:seasonal, season, year}
  end
end

The code clearly explains what is going on with the two types of editions, each type is unambiguously identified and contains only the relevant data. Great!

Design constrained by persistence

When tasked to implement this requirement in a real world application, the design thought process would likely be very different, though. We would most likely start by creating an Ecto migration, because after all, everything needs to conform to the database.

Since relational databases aren’t known for their excellent support of sum types, the migration would probably end up looking similar to this:

defmodule Bookstore.Repo.Migrations.CreateBooks do
  use Ecto.Migration

  def change do
    create table(:books) do
      add :title, :string, null: false
      add :edition_type, :integer, null: false
      add :edition_number, :integer
      add :edition_season, :string
      add :edition_year, :integer
    end
  end
end

Inevitably, there is going to be an Ecto.Schema coresponding to the database table.

The table could hold all kinds of invalid data, so we would of course attempt to contain the mess at the application level. We would use changeset validations and hopefully define an enum for edition_type and edition_season.

This would however result in all our application code knowing about these different fields, carefully checking the type and knowing which other fields are relevant based on the type.

Of course, I’m being generous. In many real world applications, there would be no enum for edition_type, the column probably wouldn’t exist at all. Instead, all the code would check which of the other fields are nil and dispatch logic based on that.

Even worse, it’s possible that somebody smart enough to do this, but not yet wise enough not to do this, would reuse the same column for edition_number and edition_year, since they both map to integers.

At this point, there is no easy way to use the data correctly and no intuitive way to understand what the invariants even are, because the code does not contain that information.

The worst of both worlds?

Let’s face it, this design with a bunch of nullable columns is both terrible application design and suboptimal database design. Yet, it is the design I see every day in the projects I work on. I think that’s because the tools we have make it the only easy option.

Improving the database design would involve normalizing the data and splitting it into multiple tables. Just imagine the nightmare of all those JOINs and Ecto.Schema associations we would need in our application. That is clearly not worth the extra complexity.

On the other hand, we could decide that the application design is the only thing that matters and simply serialize the edition as JSON. This would allow us to have the design we wanted with just a custom Ecto.Type. But we would be giving up on so many features of the database.

Custom mapping?

I have to admit that I am now entering a territory that I have not yet explored in a serious project, so there will be some speculation.

Considering the significant impact this trivial requirement has on the application design, it seems that custom mapping may be the best answer to anything beyond basic CRUD. In the case of books and editions, the mapping could look like this:

defmodule Bookstore.Ecto.Mappers.Books do
  alias Bookstore.Ecto.Schemas.Book, as: BookSchema
  alias Bookstore.Book
  alias Bookstore.Edition

  def to_domain(%BookSchema{} = data) do
    edition =
      case data.edition_type do
        :ordinal -> Edition.ordinal(data.number)
        :seasonal -> Edition.seasonal(data.season, data.year)
      end

    %Book{id: data.id, title: data.title, edition: edition}
  end

  def from_domain(%Book{} = book) do
    data = %BookSchema{id: book.id, title: book.title}

    case book.edition do
      {:ordinal, number} ->
        %{data | edition_type: :ordinal, edition_number: number}

      {:seasonal, season, year} ->
        %{data | edition_type: :seasonal, edition_season: season, edition_year: year}
    end
  end
end

Now we can design the application exactly how we want and choose any storage implementation we decide appropriate. The only thing that will need to change is the mapper. Of course, normalizing into multiple tables would require a larger change of the mapper, but the domain model would still stay the same.

Unfortunately, this approach has downsides as well. We need to write the mapping code on our own, but what’s worse, we lose important features of Ecto. Change tracking is now gone and we will need to perform even more mapping for data that comes from the outside, duplicating many of the fields.

Perhaps most importantly, enforcing these mappers and keeping their design consistent is going to be difficult and require some discipline from everyone involved in the codebase. After all, it is easier to follow design decisions set by a framework.

A possible compromise?

Mapping everything on our own is clearly a difficult task. Moreover, we are throwing away more than we would like. After all, what’s the point of ORMs if we need to do the mapping ourselves anyway? This is their job!

Using Ecto.Schemas as our domain data structures may be a good trade-off. But we need a way to model the domain without conforming everything to the database design. This includes nested data, sum types, mapping multiple columns into one field and probably a way to build one schema from multiple tables. Maybe then Ecto could be “good enough” as an ORM.

It’s entirely possible that this was of thinking leads directly into the trap described in The Vietnam of computer science. I may simply too inexperienced to see that. Maybe it’s not worth it to add all this extra complexity to Ecto. But in that case, custom mapping seems like the only option left.

Conclusion

Just like many other framworks and ORMs, Phoenix and Ecto present a devil’s bargain. As long as we are building a web interface for a database, where one form field maps to one column and the application does not need to do anything complicated with the data, everything is simple. But anything beyond that quickly starts to hurt.

Custom mapping may be a lot of work, but in order for Ecto to be good enough as an ORM, I’m afraid it would need to evolve way beyond what it is now. Assuming that an ORM can actually be good enough.

In the end, if we want to keep all the nice benefits that Ecto provides, the following will always remain true. Phoenix may not be your application, but Ecto is.

zachallaun · November 10, 2024, 4:42pm

Interesting observations and certainly worth thinking about!

While I agree with you that the naive/“obvious” schema leaves a fair amount to be desired, I’m not sure that I agree with your final conclusion. In particular:

Could you clarify which specific features you’re referring to here? Assuming Postgres, you’ll find pretty rich JSON indexing and querying support.

Sorc96 · November 10, 2024, 4:56pm

Thanks for reading! The JSON approach would surely be a good enough choice in many situations. It’s true that Postgres has great support for JSON.

We are still giving up something, though. The application is now completely responsible for ensuring that the data is in a valid shape, and as far as I know, constraints like foreign keys cannot be used on JSON fields.

I can’t shake the feeling that this is abusing the JSON type, but it could very well be a sufficiently practical implementation. On the other hand, now our application design is forcing the design of the database, since I would consider using JSON only because there is no simple way to map multiple columns to one Ecto.Schema field.

al2o3cr · November 10, 2024, 5:11pm

IMO this is fundamentally repeating the mapping that Ecto’s already doing to turn the list-of-lists response from the database into BookSchema structs.

It also sounds like you’re looking for a feature like ActiveRecord’s composed_of.

If I was building the system you’re describing, I’d reach for polymorphic_embed and have Bookstore.Edition.Seasonal and Bookstore.Edition.Ordinal embedded schemas. That’s just as straightforward to pattern-match as a tagged tuple, and can participate in things like protocols.

Sorc96 · November 10, 2024, 7:55pm

Thanks for your input! I agree, repeating the mapping isn’t great.

ActiveRecord and composed_of is exactly one of the things I thought of. I guess something like that could be achieved with a virtual field. Although I’m not sure how to feel about the data being duplicated in the schema. Ideally, the raw values would not be exposed so code in other places cannot depend on them.

I have seen polymorphic_embed before and I really like the idea. In this case, it would be another option when choosing JSON as the underlying representation.

EDIT: There is actually a proposition for something like composed_of, but it doesn’t seem active anymore. Proposal: Derived fields by greg-rychlewski · Pull Request #4261 · elixir-ecto/ecto · GitHub

sodapopcan · November 12, 2024, 11:29pm

This is a nice write up. I very much do view Ecto as my application insofar as I generally push a non-trivial amount of business logic to the database. Mostly calculations and aggregations. I do this purely in Ecto, ie, no stored procedures or triggers.

Otherwise this write-up, especially your examples, has me thinking even more about designing functions-first, or verbs-first, or is it just “functional design”? I’m currently involved in yet another domain modeling exercise where all the database tables are being designed upfront. I have a nagging feeling that a classic mistake is being made for the N-millionth time. Does anyone have any goto resources they like for not falling into this trap? I remember watching a talk a few years ago which I’m having trouble finding now. Am I even making sense??

ErikNaslund · November 13, 2024, 2:18am

Polymorphic embed looks cool, thanks for sharing! If I get it correctly it seems like it’s a nice way to handle your embedded schemas, and they’ll end up being persisted as JSON(B) or something in the database? Seems cool and I can definitely see a few use cases.

JSON fields in your DB

I’ve walked the path of using too many JSON fields in my tables before. I’d say the main downsides IMO are:

You’re putting more responsibility on the application logic to make sure the data is sane. The database won’t help you cascading deletes, updates and such like it does with a traditional relational design. You can use stuff like CHECK constraints with JSON functions to make sure the data doesn’t get “out of shape”, but it’s very easy to forget to add the right ones.
Data migrations get scarier, and bugs tend to pop up after a few years. Even though there is an embedded schema in the Elixir code this too will evolve over time. Even when writing JSON migrations it’s easy to mess up and get data in an unexpected state.
You’ll need some new GiN/GiST indexes for good JSON query performance. If it’s a high throughput table adding extra indexes might have a significant impact. Most, but not all queries, have good performance done over indexed JSON data.
The syntax for querying with JSON functions is there, but it’s not nearly as nice as classical SQL/Ecto imo.

I hope I don’t come across as being an absolutist against storing JSON in a relational database. There can certainly be a few cases where it makes sense. I just tend to avoid it as a default nowadays, unless I have a good reason to go for it. If you do use a JSON field, do add a “version” key to it though. Helps a lot when it comes to data migrations and backwards compatibility.

One of the few cases where JSON fields makes sense to me is where the data that’s stored is mostly treated like an “opaque blob” on the server side code. Imagine having a client side rich text editor which stores its content as JSON. The Elixir code would probably never really care what the content was, as long as it was valid JSON. The advantage compared to storing it as an actual blob is that you could easily do stuff like ad-hoc queries on the data using the JSON functions if you need to in the future. In that case you might even extract some part of the data from the JSON into a separate column.

Database table inheritance

Databases like Postgres actually do support table inheritance if you want to. It’s definitely not a solution for every kind of problem, but I’ve had it come in handy every now and then.

Python model ineritance and polymorphism

I’ve done Python for 10+ years before moving over to Elixir, and two prominent “solutions” to this problem there are Django model inheritance and SQLAlchemy polymorphic tables.

In a nutshell they abstract away the issue that you have different variations of the “edition” in your case. They do this by creating one or more extra DB tables under the hood.

Will you get more JOINs and will it complicate your life? A little bit, but not as much as you’d think. Those two options are pretty solid abstractions, and they do a LOT of the heavy lifting for you. I’m not saying that such a solution is a silver bullet, but I wouldn’t rule it out without trying it for your use case.

Do I know the best way to do the same in Elixir - no

Ash is kinda cool

Since you mentioned you’re into DDD-stuff I think you should check out Ash if you haven’t already. It has some really nice ideas, and its goal is to allow you to “model your domain, derive the rest”.

I’ve only played around with it a tiny bit. It was a bit mind-bending at first but there’s a lot of pretty clever stuff in there that I like. My next toy project will most likely use Ash to learn it a bit better.

Sorc96 · November 13, 2024, 9:32am

Thanks for your response, I’m glad I’m not the only one with these thoughts.

I definitely agree about designing the functions/behaviour first and just like you, when I hear people start talking about the database design first, I feel like we’re heading into a trap but don’t know how to change direction.

Sorc96 · November 13, 2024, 9:49am

Thanks for such a thorough reply. I certainly agree that using JSON columns for data with a known shape is a bit weird for all the reasons you mentioned.

I mostly have experience with Ruby and Rails, where polymorphic associations exist exist, although they are mostly managed on the application level.

I know that either option is usable, but neither seems optimal. We can have the application design we want at the cost of giving up on some features of the database, or have an appropriate database design at the cost of more application level complexity. The question is how much can this be improved with more features in Ecto?

I haven’t yet found an opportunity to play around with Ash, but it definitely looks very interesting.

dimitarvp · November 13, 2024, 10:26am

I don’t disagree with yours and @sodapopcan’s premise but it also has to be said that usually code follows data i.e. the most important thing you will do is to design the data (well or not). Throughout my long career I’ve spotted that a lot of deficient code has been written just to accommodate bad data design.

Obviously this doesn’t mean an all-hands meetings where every minute schema detail is discussed. But it can’t be brushed away as not important either.

This also ties to the JSON columns: they are a good escape hatch so you don’t have to use a NoSQL DB or a plain KV store. Shove semi-unstructured data in there, write meticulous code to make use of them, observe and watch and when certain common usage patterns emerge, start plucking data out of the JSON columns into RDBMS columns proper.

Not easy by any stretch but I haven’t found a better way.

krisleech · November 13, 2024, 12:44pm

If you don’t need to query the schemaless data you could also dump it in to a TEXT field as an erlang binary:

  def serialize(term) do
    term
    |> :erlang.term_to_binary()
    |> Base.encode64()
  end

  # Plug.Crypto.non_executable_binary_to_term() is used as events may contain user input
  # and deserializing executable code, such as functions, is unsafe.
  #
  # For more information see: https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/serialisation.html
  #
  def deserialize(binary) do
    binary
    |> Base.decode64!()
    |> Plug.Crypto.non_executable_binary_to_term()

Unlike JSON you’ll get the exact struct (or other term) back, e.g. dates will be dates, not strings)

krisleech · November 13, 2024, 12:46pm

I’m currently involved in yet another domain modeling exercise where all the database tables are being designed upfront. I have a nagging feeling that a classic mistake is being made for the N-millionth time. Does anyone have any goto resources they like for not falling into this trap? I remember watching a talk a few years ago which I’m having trouble finding now. Am I even making sense??

You can try Event Storming, it will focus you on the domain not the the data.

Sorc96 · November 13, 2024, 1:57pm

Your comment made me realize something interesting. I think we’re encountering the same issue as the object-relational (or graph-relational) mismatch. When designing the database first, we end up conforming not just the application code to it (which can be acceptable to a certain degree), but also the way our software can be used easily.

And the users won’t be happy when using the app the way they want is awkward because of the decisions we made about the storage mechanism. A classic example would be when data can be entered into the system partially, like information about products being sold. But of course, in order for customers to buy the product, all of the information needs to be in the system.

When thinking about the process first, we would most likely realize the need to strictly separate the work-in-progress products from ones that can be sold. But in practice, there’s probably just one table for products with a column which marks that the product can be sold. Inevitably, products with incomplete information end up being sold, creating problems all over the place.

sodapopcan · November 14, 2024, 11:54pm

I’ve actually done example mapping before! It was on a project with a very small model, though, and the primarily domain modelling had already been done. It was a very UI-heavy app (a gantt chart) and I think there were a total of maybe 8 business models? I feel like it was less but this was a few years ago. Everything had already been modeled and there were far more changes to verbs than nouns in general.

I’m not arguing that data isn’t the most important thing but that modeling it based on how it’s stored is not ideal. Using behaviour to inform its design is (likely) better. This is what I understood @Sorc96’s premise to be. I think I agree but I just don’t have the experience of knowledge to back that up which is why I asked if anyone had any really good resources. I assume it’s limited as I don’t know of a web framework that doesn’t push you to model data based on its storage and it seems like one of those things that most people are going to say “It’s what everyone already knows and is good enough.”

I didn’t respond to this earlier but having composed_of in Ecto would be great! A friend of mine actually did a really good 2018 Rails Conf talk that includes composed_of. I feel a little dirty slipping a plug in here, but it’s a really good talk!

tfwright · November 15, 2024, 12:49am

I wonder if this is really true though?

I think there can certainly be certain features of frameworks that make it more or less difficult to design storage agnostic models but at the end of the day there will always be tradeoffs for doing so. ActiveRecord’s philosophy was very clearly to help you forget about the DB and as a novice developer I drank deeply of that particular kool aid (even before NoSQL became a real buzzword) because I found SQL to be a boring, terrible API in comparison to Ruby, and as the runtime env for my application the latter felt much closer to the “life” of my application. But even AR had aspects that clearly were geared to relational dbs (hard to imagine “associations” outside the context of a relational db). And isn’t that because relational dbs have features that, even if they sometimes seem like purely technical, implementation details, are really part of the value you are offering your users? “Persistence” isn’t some weird accidental part of software, it’s a real world need with its own logic that most frameworks want to expose.

So I guess the answer is yes, frameworks do push that, but not in an essentially different way than they push you to validate inputs, etc.

D4no0 · November 15, 2024, 8:14am

I think this is a good thing honestly. I love that you can write your own queries in Ecto and you don’t have to rely on some higher-level abstractions that ORMs from languages like c#/java offer. The problem with those types of ORMs is that they are opinionated in a lot of the decisions, not flexible enough and not extensible.

I always tend to showcase how easily ecto queries can be extended:

  defmacro st_transform(wkt, srid) do
    quote do: fragment("ST_Transform(?, ?)", unquote(wkt), unquote(srid))
  end

Good luck implementing that with an ORM that tries to abstract everything.

LostKobrakai · November 15, 2024, 8:40am

I think ecto has a hard time to be explained because by being a data mapping tool it is great at two edges of the system – the one where data comes in and the one where data goes out (usually for storage). In a lot of cases, where data is shaped to allow for it, it’s very easy to just not think of those as two edges, but just wire data right through. Therefore that’s a lot of what we see in code shared around elixir. Imo this is however just a lazy mans shortcut, fair to be used where possible, but which needs to be acknowledged as not the way forward where it doesn’t anymore. If there’s a need for richer domain modeling, which deviates from the constraints of storing the data, then that’s imo no longer ectos job, even if ecto might be part of implementing that. That’s one of the places where ecto hard deviates from what ORMs commonly do. Ecto wants to model your data stored in your db by embracing that the data needs to go into a db, instead of finding magic ways to make you do things dbs cannot do well like polymorphic relationships.

If one is fine with the lack of referential integrity of ref_id, ref_type columns then that can be built with ecto. Others might go with abstract tables polymorphism, which retains referential integrity. Or you might not need to have foreign keys, so you opt for polymorphic data in a json column. One might even use all three options depending on their distinct tradeoffs.

Sorc96 · November 15, 2024, 10:01am

I have the same experience. It’s probably because of how easy it is to get started wit this approach. Most people, especially less experienced devs, don’t want to start thinking about mapping between UI, domain and persistence layers. Just define one model/schema and have it go from the html form all the way to the database and you’re done.

But as the application becomes more and more complex, this approach starts to take its toll. Suddenly, domain logic is doing way too many checks for underlying details of the persistence layout, database tables are designed suboptimally to avoid even more application level complexity, and custom mapping starts happening all over the place because the UI doesn’t match the database 1:1.

I’m not sure if I will ever be truly happy with any solution, but I’m currently at a point where I think more ORM features (like composed_of) could help a lot.

Sorc96 · November 15, 2024, 10:18am

I just realized the same thing after writing the initial post. Ecto has amazing capabilities for working with the database and changesets work well for validating the data that comes in from the outside. But having these two concerns (and also domain logic) align is a happy accident that cannot be relied upon.

I think the weakest part is probably Ecto.Schema. It provides enough features that I don’t want to work without it, while also constraining design significantly. In order to make schemas work better without custom mapping to other data structures, they would need to support more mapping features like composed_of mentioned above.

I also think schemas would need more versatility in terms of the underlying representation. Maybe a schema should not be tied to a single table. Why not allow any queryable? That way, schemas could automatically be limited to certain rows by a where clause or combine data from multiple tables using join. This would allow schemas to be used with the full power of Ecto.Query. However, I am aware that this could also make insert/update impossible, but I still think the benefits to read operations would be worth it.

EDIT: To make my point clearer, I think the problem is that we have to either use Ecto.Schema and give up on the powerful features of Ecto.Query, or use the full potential of Ecto.Query in out custom mapping functions, but then we’re giving up the benefits of Ecto.Schema.

LostKobrakai · November 15, 2024, 10:48am

def schema_from_two_tables do
  from a in "a", 
    join: b in "b", 
    on: a.id == b.a_id, 
    select: %Schema{id: a.id, from_b: b.data}
end

Nobody requires you to query based on the table defined on the schema macro. The name provided to the schema table is also never validated to be an actual valid table name. It is only required to be valid if you try to query by it. E.g. {"filtered_people", Person} is a queryable, to query values using a schema from a separately provided table.

So I’d argue you already have all the freedom to query schemas how you like, we’re just having a hard time communicating all of that to people I think. You are correct though that inserts/updates are a lot trickier that way. If you need inserts/updates you can look into updatable views with postgres, but also I’d argue CQS, where you have separate schemas/models for querying and writing is something reasonable to look at.