Schema "inheritance" in Ecto

lud · November 2, 2024, 11:12am

I have a good use case currently with ecto schemas: I need to extend one schema with vectors (text embeddings) fields that I do not want to interact with using the regular schema. And I do not want to repeat all the regular fields and primary key / timestamp type attributes on the extended schema.

josevalim · November 2, 2024, 11:44am

I am afraid Ecto schemas muddies the water: since they define structs, do you want to copy all struct fields? What about the database? Are you going to copy the database table structures as well?

@dimitarvp mentioned the XY problem before. You should describe the problem and outline the different solutions you have considered. Perhaps extension does achieve the best trade-offs. But I am afraid a paragraph describing a possible use case is not enough context.

lud · November 2, 2024, 6:01pm

This is what I meant by schema inheritance:

defmodule MyApp.MySchema do
  import Ecto.Changeset
  use Ecto.Schema

  @primary_key {:id, :binary_id, autogenerate: true}
  @timestamps_opts [type: :utc_datetime_usec]
  schema "my_table" do
    field :name, :string
    field :age, :integer
    # ...
    # ...
    # lots of fields

    timestamps()  
  end
end

defmodule MyApp.MySchemaWithEmbeddings do
  import Ecto.Changeset
  use Ecto.Schema

  schema "my_table" do
    copy_all_fields_from MyApp.MySchema

    field :some_embedding_1, Pgvector.Ecto.Vector
    field :some_embedding_2, Pgvector.Ecto.Vector
    field :some_embedding_3, Pgvector.Ecto.Vector
    field :some_float_1, :float
    field :some_float_2, :float
    field :some_float_3, :float
  end
end

Basically the base schema is used often, and most fields are useful so we just select all of them like from(s in MySchema).

But when working with embeddings that are vectors of 512 floats, we may not want to select all that data all the time, and do related stuff in the changeset function, etc.

So having a schema that extends another one could be useful, so we can work with the embeddings and related data only when needed, which is just a small feature of the app.

Now I guess writing that copy_all_fields_from macro will not be hard, and I do not mind copying @primary_key or @timestamps_opts attributes. But maybe having a extend_schema instead of the schema macro could be nice.

josevalim · November 2, 2024, 6:09pm

Thanks for the context. I think this is a separate discussion as, opposite to inheritance which inherits data and behaviour, you just want to have the same schema fields. For this reason, I moved to a separate thread.

I can think of two other alternatives:

Have the embeddings in a separate table which you treat it as a has_one
Have the embeddings in the same table but model it in Ecto using an association that points to itself

The options above have downsides, as they need additional queries to load data, so copy_all_fields_from (and potentially extend_schema in the future) do not sound like bad options.

PS: I think you can change title and tags in this thread but, if you cannot, let me know and I can do it!

sodapopcan · November 2, 2024, 6:06pm

One thing you could do is make a virtual field for the vector data then only load it when you need it.

EDIT: oops sorry @josevalim I responded right after you split the thread.

lud · November 2, 2024, 6:16pm

The title and tags looks fine

I can think of two other alternatives:

Yeah I am not ready to be testing, just had worked on the maths and stuff parts in a separate demo codebase inherited from coworkers, and we will now try to make it work in the real app.

The self-pointing association looks cool, I guess the hit on performance should be minimal and I am not sure everyone would be happy with a macro that forces everyone to learn macros and what __schema__/1 is.

Thank you!

We will decide as a team but I’ll update this topic if anyone is interested

linusdm · November 2, 2024, 6:34pm

That’s a very interesting idea! More than once I tried modelling a recurring set of fields on a schema, without resorting to a separate database table, or an embedded schema that has to reside in a jsonb column.

For example: many schemas in an application I am working on have the need to be associated with an mail address (the analog one, street/city/etc.). I resorted to using associations and separate tables for each of those schemas that needed an address (users got a sibling table users_address, reader got a sibling table reader_address, and so on, with FK’s between them). I used the “custom source” option when referencing the Address struct (has_one :address, {"users_address", Address}) to override the which table to use each time I associated a schema with an Address, to avoid them all be in the same table (which might not be a problem after all, I guess ).
Embedding the address didn’t seem right at the time. I needed to be able to search on parts of the address for example (I know that a lot is possible with jsonb colums, but I didn’t have much knowledge about that can of worms at the time).

When I was setting this up I wished I could reuse an Address schema, and have its data in the table of the parent schema. This would have been the best way to model the data from the perspective of the database.

Another alternative I thought about was doing something like timestamps/1 to “inject” the set of fields that model an address into each schema that needs it. But then an address still would seem like bag of fields, without a “home” of some kind. I’m sure this alternative has other downsides too.

The “self-referencing” association seems like a good alternative. I understand that it has the downside of an additional query, because ecto doesn’t actually know that it can get both schema data in one query. Maybe that’s something that can be improved by Ecto by making this pattern a first-class citizen. Would there be other donwsides?

josevalim · November 2, 2024, 7:31pm

We could optimize in some cases such as joins but preloads are always by definition separate queries. But that may not be a problem given the whole intent is that may be loaded in different places?

linusdm · November 2, 2024, 7:51pm

I admit that in my example of addresses the Address assoc is not needed in most cases. We’re just fine preloading it in scenarios where we actually need the Address.

But I don’t think it would be a bad default to load the assoc-in-same-table by default, as there is not much downside to have it there in case you need it. The only downside I can think of is the memory footprint that goes up, and the time required to actually load the struct from the database into the Ecto struct. It might make a difference though.

I guess this approach would conflate the difference between fields and assocs a little bit: fields are obviously part of a schema, and are loaded by default (iirc you can opt-out of this default loading behavior), while common assocs have to be preloaded explicitly.

zachdaniel · November 2, 2024, 10:36pm

Old school macros can help here too no?

defmacro foo_fields() do
  quote do
    field ...
    field ...
    field ...
  end
end

schema "table" do
  foo_fields()
  field ....
end

schema "table" do
  foo_fields()
  field ...
end

sodapopcan · November 2, 2024, 11:18pm

I just realized my response is semi-nonsense. It could work but probably more trouble than it is worth, especially compared to other answers here.

If it isn’t a hard requirement that they be separate schemas, different selects is a pretty simple solution.

defmodule MyApp.MySchema do
  # ...

  @embedding_fields [
    :some_embedding_1,
    :some_embedding_2,
    :some_embedding_3
  ]

  def without_embeddings(query) do
    from query,
      select: ^(__schema__(:fields) -- @embedding_fields)
  end
end

defmodule MyApp.MyContext do
  def get_schema(id) do
    MySchema
    |> MySchema.without_embeddings()
    |> Repo.get()
  end

  def get_schema_with_embeddings(id) do
    Repo.get(MySchema, id)
  end
end

Of course if your consumer code checks for the presence of embedding fields, this isn’t going to work.

josevalim · November 2, 2024, 11:24pm

Doh, of course! My favorite answer so far.

zachdaniel · November 2, 2024, 11:50pm

This is my favorite part of Elixir! Because the compiler is just executing Elixir code, you can get all kinds of benefits most other languages use “structural” things like inheritance for.

Want to share fields between a struct?

defstruct [:a, :b] ++ Something.shared_fields()

# elsewhere
defstruct [:c, :d] ++ Something.shared_fields()

Want to share functions between multiple modules?

defmacro shared_functions() do
  quote do
     def function() do
       ...
     end
  end
end

The Elixir compiler gives you all of the tools for code reuse with no need for adopting any confusing patterns. All you need to learn is how macros work and the sky is your limit.

(Obviously you already know these things )

Thanks for making the best programming language of all time

tfwright · November 2, 2024, 11:58pm

Exactly what I was going to say and I was wondering if I was missing something basic about how Ecto works that made that difficult. I think it’s much more in the spirit of Elixir for devs to roll their own macro for these cases, using whatever naming conventions/API they like best.

lud · November 3, 2024, 12:24am

If the macro can be defined in the same module as the main schema then yes, otherwise I’d rather not add indirection. Actual inheritance “feels” more straightforward.

On mobile right now but iirc schema block is evaluated at compile time.

zachdaniel · November 3, 2024, 12:52am

I don’t think that it can be defined in the same module, primarily because modules cannot call macros that they define, from their own module body. I’d personally consider it a good thing

krasenyp · November 3, 2024, 5:35am

Yeah, well, just define two modules in one file.

zachdaniel · November 3, 2024, 9:05am

Right, of course

defmodule Vehicle.Fields do
  defmacro fields() do
    quote do
      field :capacity, :integer
    end
  end
end

defmodule Vehicle do
  use Ecto.Schema
  require Vehicle.Fields
  
  schema "schema" do
     Vehicle.Fields.fields()  
  end
end

defmodule Boat do
  use Ecto.Schema
  require Vehicle.Fields
  
  schema "schema" do
     Vehicle.Fields.fields()  
     field :wheel_count, :integer
  end
end

I’ve dealt personally with a lot of clarity issues that can arise from the implicit aspects here. Just throwing out a potential alternative. What if you wrote something that used schema introspection to verify, instead of inject this information?

defmodule Vehicle.Fields do
  @required_fields [
    capacity: :integer
  ]
  
  defmacro __using__(_) do
    quote do
      @after_compile Vehicle.Fields
    end
  end
  
  def __after_compile__(env, _) do
    for {field, required_type} <- @required_fields do
      type = env.module.__schema__(:type, field)
      
      if !type do
        raise "Must define the field `#{inspect(field)}` on #{inspect env.module}"
      end
      
      if type != required_type do
        raise "The field `#{inspect(field)}` on #{inspect env.module} must be of type #{inspect(type)}, got: #{type}"
      end
    end
  end
end

defmodule Vehicle do
  use Ecto.Schema
  use Vehicle.Fields
  
  schema "schema" do
    field :capacity, :integer
  end
end

defmodule Boat do
  use Ecto.Schema
  use Vehicle.Fields
  
  schema "schema" do
    field :wheel_count, :integer
  end
end

That last module definition would then yield

** (RuntimeError) Must define the field `:capacity` on Boat
    iex:31: anonymous fn/3 in Vehicle.Fields.__after_compile__/2

BartOtten · November 3, 2024, 11:04am

While true, it’s also a tool with drawbacks. And now I think of it, those drawbacks mainly consist of others -not- knowing macro’s…so we should educate all!

I do. But I doubt I would use a macro for it. The drawback of lacking visibility, some tools struggling and available alternatives are for me reasons to do a bit more manual work instead of a new macro.

For libs it’s great though as users don’t have to update their imports and the lib maintainer can change them without worrying about the ‘migration guide’

Your own tool might make the last reason obsolete though

linusdm · November 3, 2024, 11:08am

Interesting. The only pushback I’d have is that this validation could also be done in a unit test. The code is basically asserting that the fields exist, and maybe that’s not something that needs to be checked with every compilation pass (just imagine this in a library, then this would also be checked after installing the lib, which is probably not the best timing).