Schema "inheritance" in Ecto

I have a good use case currently with ecto schemas: I need to extend one schema with vectors (text embeddings) fields that I do not want to interact with using the regular schema. And I do not want to repeat all the regular fields and primary key / timestamp type attributes on the extended schema.

I am afraid Ecto schemas muddies the water: since they define structs, do you want to copy all struct fields? What about the database? Are you going to copy the database table structures as well?

@dimitarvp mentioned the XY problem before. You should describe the problem and outline the different solutions you have considered. Perhaps extension does achieve the best trade-offs. But I am afraid a paragraph describing a possible use case is not enough context.

4 Likes

This is what I meant by schema inheritance:

defmodule MyApp.MySchema do
  import Ecto.Changeset
  use Ecto.Schema

  @primary_key {:id, :binary_id, autogenerate: true}
  @timestamps_opts [type: :utc_datetime_usec]
  schema "my_table" do
    field :name, :string
    field :age, :integer
    # ...
    # ...
    # lots of fields

    timestamps()  
  end
end

defmodule MyApp.MySchemaWithEmbeddings do
  import Ecto.Changeset
  use Ecto.Schema

  schema "my_table" do
    copy_all_fields_from MyApp.MySchema

    field :some_embedding_1, Pgvector.Ecto.Vector
    field :some_embedding_2, Pgvector.Ecto.Vector
    field :some_embedding_3, Pgvector.Ecto.Vector
    field :some_float_1, :float
    field :some_float_2, :float
    field :some_float_3, :float
  end
end

Basically the base schema is used often, and most fields are useful so we just select all of them like from(s in MySchema).

But when working with embeddings that are vectors of 512 floats, we may not want to select all that data all the time, and do related stuff in the changeset function, etc.

So having a schema that extends another one could be useful, so we can work with the embeddings and related data only when needed, which is just a small feature of the app.

Now I guess writing that copy_all_fields_from macro will not be hard, and I do not mind copying @primary_key or @timestamps_opts attributes. But maybe having a extend_schema instead of the schema macro could be nice.

1 Like

Thanks for the context. I think this is a separate discussion as, opposite to inheritance which inherits data and behaviour, you just want to have the same schema fields. For this reason, I moved to a separate thread.

I can think of two other alternatives:

  • Have the embeddings in a separate table which you treat it as a has_one
  • Have the embeddings in the same table but model it in Ecto using an association that points to itself

The options above have downsides, as they need additional queries to load data, so copy_all_fields_from (and potentially extend_schema in the future) do not sound like bad options.

PS: I think you can change title and tags in this thread but, if you cannot, let me know and I can do it!

1 Like

One thing you could do is make a virtual field for the vector data then only load it when you need it.

EDIT: oops sorry @josevalim I responded right after you split the thread.

The title and tags looks fine :slight_smile:

I can think of two other alternatives:

Yeah I am not ready to be testing, just had worked on the maths and stuff parts in a separate demo codebase inherited from coworkers, and we will now try to make it work in the real app.

The self-pointing association looks cool, I guess the hit on performance should be minimal and I am not sure everyone would be happy with a macro that forces everyone to learn macros and what __schema__/1 is.

Thank you!

We will decide as a team but Iā€™ll update this topic if anyone is interested :slight_smile:

1 Like

Thatā€™s a very interesting idea! More than once I tried modelling a recurring set of fields on a schema, without resorting to a separate database table, or an embedded schema that has to reside in a jsonb column.

For example: many schemas in an application I am working on have the need to be associated with an mail address (the analog one, street/city/etc.). I resorted to using associations and separate tables for each of those schemas that needed an address (users got a sibling table users_address, reader got a sibling table reader_address, and so on, with FKā€™s between them). I used the ā€œcustom sourceā€ option when referencing the Address struct (has_one :address, {"users_address", Address}) to override the which table to use each time I associated a schema with an Address, to avoid them all be in the same table (which might not be a problem after all, I guess :thinking: ).
Embedding the address didnā€™t seem right at the time. I needed to be able to search on parts of the address for example (I know that a lot is possible with jsonb colums, but I didnā€™t have much knowledge about that can of worms at the time).

When I was setting this up I wished I could reuse an Address schema, and have its data in the table of the parent schema. This would have been the best way to model the data from the perspective of the database.

Another alternative I thought about was doing something like timestamps/1 to ā€œinjectā€ the set of fields that model an address into each schema that needs it. But then an address still would seem like bag of fields, without a ā€œhomeā€ of some kind. Iā€™m sure this alternative has other downsides too.

The ā€œself-referencingā€ association seems like a good alternative. I understand that it has the downside of an additional query, because ecto doesnā€™t actually know that it can get both schema data in one query. Maybe thatā€™s something that can be improved by Ecto by making this pattern a first-class citizen. Would there be other donwsides?

1 Like

We could optimize in some cases such as joins but preloads are always by definition separate queries. But that may not be a problem given the whole intent is that may be loaded in different places?

1 Like

I admit that in my example of addresses the Address assoc is not needed in most cases. Weā€™re just fine preloading it in scenarios where we actually need the Address.

But I donā€™t think it would be a bad default to load the assoc-in-same-table by default, as there is not much downside to have it there in case you need it. The only downside I can think of is the memory footprint that goes up, and the time required to actually load the struct from the database into the Ecto struct. It might make a difference though.

I guess this approach would conflate the difference between fields and assocs a little bit: fields are obviously part of a schema, and are loaded by default (iirc you can opt-out of this default loading behavior), while common assocs have to be preloaded explicitly.

Old school macros can help here too no?

defmacro foo_fields() do
  quote do
    field ...
    field ...
    field ...
  end
end

schema "table" do
  foo_fields()
  field ....
end

schema "table" do
  foo_fields()
  field ...
end
8 Likes

I just realized my response is semi-nonsense. It could work but probably more trouble than it is worth, especially compared to other answers here.

If it isnā€™t a hard requirement that they be separate schemas, different selects is a pretty simple solution.

defmodule MyApp.MySchema do
  # ...

  @embedding_fields [
    :some_embedding_1,
    :some_embedding_2,
    :some_embedding_3
  ]

  def without_embeddings(query) do
    from query,
      select: ^(__schema__(:fields) -- @embedding_fields)
  end
end

defmodule MyApp.MyContext do
  def get_schema(id) do
    MySchema
    |> MySchema.without_embeddings()
    |> Repo.get()
  end

  def get_schema_with_embeddings(id) do
    Repo.get(MySchema, id)
  end
end

Of course if your consumer code checks for the presence of embedding fields, this isnā€™t going to work.

1 Like

Doh, of course! My favorite answer so far.

3 Likes

This is my favorite part of Elixir! Because the compiler is just executing Elixir code, you can get all kinds of benefits most other languages use ā€œstructuralā€ things like inheritance for.

Want to share fields between a struct?

defstruct [:a, :b] ++ Something.shared_fields()

# elsewhere
defstruct [:c, :d] ++ Something.shared_fields()

Want to share functions between multiple modules?

defmacro shared_functions() do
  quote do
     def function() do
       ...
     end
  end
end

The Elixir compiler gives you all of the tools for code reuse with no need for adopting any confusing patterns. All you need to learn is how macros work and the sky is your limit.

(Obviously you already know these things :smiley:)

Thanks for making the best programming language of all time :heart:

6 Likes

Exactly what I was going to say and I was wondering if I was missing something basic about how Ecto works that made that difficult. I think itā€™s much more in the spirit of Elixir for devs to roll their own macro for these cases, using whatever naming conventions/API they like best.

1 Like

If the macro can be defined in the same module as the main schema then yes, otherwise Iā€™d rather not add indirection. Actual inheritance ā€œfeelsā€ more straightforward.

On mobile right now but iirc schema block is evaluated at compile time.

I donā€™t think that it can be defined in the same module, primarily because modules cannot call macros that they define, from their own module body. Iā€™d personally consider it a good thing :slight_smile:

Yeah, well, just define two modules in one file.

Right, of course :laughing:

defmodule Vehicle.Fields do
  defmacro fields() do
    quote do
      field :capacity, :integer
    end
  end
end

defmodule Vehicle do
  use Ecto.Schema
  require Vehicle.Fields
  
  schema "schema" do
     Vehicle.Fields.fields()  
  end
end

defmodule Boat do
  use Ecto.Schema
  require Vehicle.Fields
  
  schema "schema" do
     Vehicle.Fields.fields()  
     field :wheel_count, :integer
  end
end

Iā€™ve dealt personally with a lot of clarity issues that can arise from the implicit aspects here. Just throwing out a potential alternative. What if you wrote something that used schema introspection to verify, instead of inject this information?

defmodule Vehicle.Fields do
  @required_fields [
    capacity: :integer
  ]
  
  defmacro __using__(_) do
    quote do
      @after_compile Vehicle.Fields
    end
  end
  
  def __after_compile__(env, _) do
    for {field, required_type} <- @required_fields do
      type = env.module.__schema__(:type, field)
      
      if !type do
        raise "Must define the field `#{inspect(field)}` on #{inspect env.module}"
      end
      
      if type != required_type do
        raise "The field `#{inspect(field)}` on #{inspect env.module} must be of type #{inspect(type)}, got: #{type}"
      end
    end
  end
end

defmodule Vehicle do
  use Ecto.Schema
  use Vehicle.Fields
  
  schema "schema" do
    field :capacity, :integer
  end
end

defmodule Boat do
  use Ecto.Schema
  use Vehicle.Fields
  
  schema "schema" do
    field :wheel_count, :integer
  end
end

That last module definition would then yield

** (RuntimeError) Must define the field `:capacity` on Boat
    iex:31: anonymous fn/3 in Vehicle.Fields.__after_compile__/2
3 Likes

While true, itā€™s also a tool with drawbacks. And now I think of it, those drawbacks mainly consist of others -not- knowing macroā€™sā€¦so we should educate all!

I do. But I doubt I would use a macro for it. The drawback of lacking visibility, some tools struggling and available alternatives are for me reasons to do a bit more manual work instead of a new macro.

For libs itā€™s great though as users donā€™t have to update their imports and the lib maintainer can change them without worrying about the ā€˜migration guideā€™

Your own tool might make the last reason obsolete though :grinning:

2 Likes

Interesting. The only pushback Iā€™d have is that this validation could also be done in a unit test. The code is basically asserting that the fields exist, and maybe thatā€™s not something that needs to be checked with every compilation pass (just imagine this in a library, then this would also be checked after installing the lib, which is probably not the best timing).