Ecto.Schema / Ecto.Changeset - optional, conditionally present fields

silverdr · January 18, 2025, 10:23pm

I have a – let’s call it ‘Parent’ schema, which embeds_one child schema. Basically a set of largely mutually exclusive key/value pairs. Something like (simplified here - actual schema is far larger):

		field :type, :string
		field :time, :time
		field :dom, :integer
		field :dow, :integer

Depending on the value in :type one of the following fields is required, while others are not relevant. I do all the casting, validating, including conditional validate_required in custom validating functions and I end up with a ready to use, valid changeset that gets persisted in a JSON field in the DB.

The problem is that ALL the fields get stored, even if they are not needed. I would like to have only the relevant fields to end up in the DB but at best I get something like:

{
  "type": "dow", 
  "time": null,
  "dom": null,
  "dow": 1
}

What can I do to have only

{
  "type": "dow", 
  "dow": 1
}

persisted, instead?

dimitarvp · January 19, 2025, 12:13am

Instead of embedding a record, can’t you just have that be a free-from singular JSON field whose contents depend on the type column? That’s what I would do because that would also emulate pretty closely what strongly statically typed languages do with sum types (enums in Rust).

silverdr · January 19, 2025, 10:24am

Theoretically yes, of course. Although

in practice the schema is far more complex and there are also subtypes involved
that would mean that I’d have to rewrite quite a bit and then handle validation and errors (on the embedded schema) myself

Thanks for an idea though.

dimitarvp · January 19, 2025, 12:42pm

Suspected that the complexity makes this non-viable, shame.

Then I suggest having a number of custom Ecto.Types whose dump functions skip fields with nil values. That seems like an okay-ish compromising solution. Though whether you’ll find the scattering of validation logic to more files beneficial or annoying is up to you.

silverdr · January 19, 2025, 2:39pm

Indeed. What I have now works but I feel like I am shovelling in tons of pure “noise” into the DB. Currently SNR in that column is less than 1:10 and possibly getting worse later…

Thank you for an even better idea.

silverdr · January 24, 2025, 7:58pm

I take there’s no way of defining a custom type for the whole schema, is it? So it’s defining each field in an embedded schema as a custom type, right?

dimitarvp · January 24, 2025, 8:19pm

Each composite field, yes. F.ex. if you need to have a column’s content vary wildly if another column has different values than that first column is a prime candidate for a custom type.

Though you can do it field by field as well but that would be an overkill IMO.

garrison · January 24, 2025, 8:29pm

What you’re really looking for here is a “polymorphic embedded schema”, which is something that comes up every so often. I think someone even wrote a library for it, though I’ve never used it.

My personal opinion: If your data is unstructured, just use a map field (JSON column) and don’t worry so much about the schema - that’s what unstructured means, after all!

If your data is structured and you are worried about the schema, then you should probably be using actual tables. That’s what they’re for!

Of course, if you wanted to go in the opposite direction with an even more degenerate solution, I bet you could put the embedded schema in a virtual field and then parse it back out at the end of the parent’s changeset, drop the nulls, and dump it into a real map field. Of course, then your data wouldn’t be structured on the way out of the DB, but you could probably cook up a solution to that too if you really wanted. The Ecto.Type hack is probably better, though.

silverdr · January 24, 2025, 9:26pm

So maybe I misunderstood, actually. My embedded_schema consists of many fields of mostly primitive types. And I kind of hoped that I can use that “model” as a custom type, where I would implement the callbacks (especially dump/1) serialising the map into compact-ed JSON to send it to DB this way. So I guess now what you’re saying is that maybe I need to rethink it and split the current schema into multiple complex custom types, which would get alternatively stored and this way those fields, which are not relevant for given combination of values would not make it to the DB. But that means I would also need to rewrite validations (as you mentioned I think now). No idea yet how though.

rathorevk · January 24, 2025, 9:34pm

There’s a library available for defining custom types, which allows you to dump or load values based on those types. It might be helpful for your use case: Parameter — Parameter v0.14.1

silverdr · January 24, 2025, 9:42pm

That probably depends on how we define “unstructured”. It is a rather large configuration structure (which is what I understand as “structured”) with multiple possible combination of elements and rules governing which “slave” fields need to be present for combination of values in some “master” fields. Still I don’t feel like using associated table for it because a) for 1:1 relation that means unnecessarily more complex and less performant queries, and b) the config schema/structure is being actively developed and iterated upon. Therefore I expect many more changes to come, which would mean lots of DB schema migrations with all their possible drawbacks. Doing it on a single map / json column seems like a more suitable approach, especially that in the end the selected, relevant set of configuration fields is always much smaller than the full schema.

That sounds about right

dimitarvp · January 24, 2025, 10:06pm

Don’t get me wrong, you can and you should. My main point is that you probably should not go for a singular God field.

Consider these sum types (expressed as Rust enums):

enum SocialNetwork {
  GitHub,
  GitLab,
  BitBucket,
  LinkedIn,
  // ...
}

enum UserProfile {
  NormalUser {
    first_name: String,
    last_name: String,
    email: String
  },
  SocialNetworkUser {
    profile_id: String,
    social_network: SocialNetwork
  }
}

I’d advocate for you to have two DB fields: type and content, whereas type would be "normal_user" or "social_network_user" and contents – everything else.

What I was trying – unsuccessfully – to convey to you is to separate at least types and contents, multiple times if necessary. You already mentioned these types of data will be in flux for a while so maybe a singular God field is the best option and you should just focus on app-level validation. Up to you. Maintainability should be the first priority.

garrison · January 24, 2025, 11:02pm

I think this is very reasonable. Don’t take my reductive opinion too seriously

I agree config data tends to be further on the “unstructured” side of the spectrum, though I don’t have any formal theories as to why (it really just is). Personally I’ve taken to storing such data as pure key/value pairs in the database (one KV per row) and then validating them (and choosing defaults) at runtime. This was the only approach that offered enough flexibility for me, and I think it’s in line with most configuration systems in general (it is very common to store configs as yml or json files for non-web apps, which are very unstructured formats).

I did try an embedded schema approach first (as you have), but there is actually another performance problem lurking which you should probably consider: it is impossible to selectively update the JSON blobs in an RDBMS (I’m using Postgres, probably you as well).

So every update has to rewrite the entire config. I wanted to keep the option open to store a lot of config pairs if needed (it’s somewhat involved but my configs are often per-object), and I didn’t want to have to rewrite all of the other options each time. Plus, there’s the fact that Ecto dumps all the null keys into the blob too (which you have, of course, discovered).

Even worse, Postgres has a habit (due to its ancient and bad MVCC storage) of duplicating tuples every time you write a new one, so this is even more costly than it seems. They will eventually be vacuumed of course (which is not free either!). Note that this is not necessarily as bad as it sounds because of HOT, which should kick in because I doubt you are indexing your config blobs!

But anyway, if you imagine having a hundred or a thousand config keys, one big blob starts to sound like a pretty bad idea.

silverdr · January 25, 2025, 10:34am

The question is “how”. I don’t see any straightforward path towards that, unless I drop the embeds_one from my “parent” model.

I see, roger.

Thanks for pointing out. This doesn’t bother me that much though as this config is read-heavy. If it’s get completely replaced once in a while doesn’t matter so much as reading it “all the time”.

Yup, precisely that.

No, I don’t I index data, by which I want to retrieve (possibly aggregated) information. I don’t need to aggregate any information on configs as of now

It does. While I am not in the thousand keys range and don’t expect to ever get there. The hundred one doesn’t seem completely improbable. Which is the main reason I started this thread – how do I store only what is needed for a given configuration rather than just “everything”, including 90% of noise.

garrison · January 25, 2025, 6:31pm

To expand a bit on what I did, and to be clear I’m not necessarily suggesting you do this (it depends on your use case), but I essentially decided to bypass Ecto for my config storage. I wrote a Config module that directly interfaces with the underlying key/value rows and then just wrote my own load/dump functions to encode whatever types I want (I think I still only have strings and bools - maybe some day I’ll need numbers, but probably not much more than that).

Throw in a function to grab multiple keys at once (to avoid round trips) and that’s about all I ended up needing for basic config functionality. And again, for my use case I really like that I can write keys individually without having to re-encode the blob. Validation is done on the way out which means I can, for example, set a default with a simple Config.get(user, :string, "key", "default") and the data is still sparse in the DB.

This is just one path you could take of course, and if you’ve already shipped something then you probably wouldn’t want to bother migrating. Just offering another perspective