Ecto embedded_schema storing extra/default values for fields that aren't specified

onomated · November 23, 2020, 12:53pm

I’m in the process of migrating a codebase and accompanying data to Elixir/Ecto, and I notice that my embedded schema (backed by jsonb columns) now include all fields defined in the schema, even when values for those fields are not provided.

Consider the following data model:

Asset model to hold info regarding an attachment asset. Not all info will be known/provided:

defmodule Asset do
  use Ecto.Schema

  @primary_key false

  embedded_schema do
    field :url, :string
    field :filename, :string
    field :mime_type, :string
    field :size, :integer
    field :width, :integer
    field :height, :integer
  end

  def changeset(asset, attrs) do
    asset
    |> cast(attrs, [:url, :mime_type, :width, :height])
  end
end

Attachment model holds info regarding media attachments which includes different assets of varying “editions” i.e. original and derived editions of the asset such as thumbnails, medium squares etc

defmodule Attachment do
  use Ecto.Schema

  schema "attachments" do
    field :attachment_provider, AttachmentProvider
    field :attachment_type, AttachmentType

    field :attached_count, :integer, read_after_writes: true

    timestamps(updated_at: false)

    embeds_one :content_data, ContentData, on_replace: :delete, primary_key: false do
      field :provider_id, :string
      
      embeds_one :original, Asset
      embeds_one :full, Asset
      embeds_one :medium, Asset
      embeds_one :thumb, Asset
    end
  end

Attachments may come from direct user uploads, or third party links. With third party links there are no editions. On saving third party links with no editions:

attrs = %{
          attachment_provider: :some_provider,
          attachment_type: :image,
          content_data: %{
            provider_id: "deadbeef",
            original: %{
              url: "https://example.com/image-link"
            }
          }
        }

%Attachment{}
    |> change(attrs)
    |> Repo.insert!()

This results in the following jsonb content_data column data:

{
  "full": null,
  "thumb": null,
  "medium": null,
  "original": {
    "url": "https://example.com/image-link",
    "size": null,
    "width": null,
    "height": null,
    "filename": null,
    "mime_type": null
  },
  "provider_id": "deadbeef"
}

Is there a way to get this to store just the provided data? So the goal here is to store the following only:

{
  "original": {
    "url": "https://example.com/image-link"
  },
  "provider_id": "deadbeef"
}

Take another example, where there are editions provided, but not all the fields are known:

attrs = %{
          attachment_provider: :us,
          attachment_type: :image,
          content_data: %{
            full: %{
              url: "https://example.com/our-uploads-full.jpg",
              filename: "our-uploads-full.jpg",
              mime_type: "image/jpeg"
            },
            medium: %{
              url: "https://example.com/our-uploads-med.jpg",
              filename: "our-uploads-med.jpg",
              mime_type: "image/jpeg"
            },
            thumb: %{
              url: "https://example.com/our-uploads-thumb.jpg",
              filename: "our-uploads-thumb.jpg",
              mime_type: "image/jpeg"
            }
          }
        }

%Attachment{}
    |> change(attrs)
    |> Repo.insert!()

The resulting jsonb content_data column data is:

{
  "full": {
    "url": "https://example.com/our-uploads-full.jpg",
    "size": null,
    "width": null,
    "height": null,
    "filename": "our-uploads-full.jpg",
    "mime_type": "image/jpeg"
  },
  "thumb": {
    "url": "https://example.com/our-uploads-thumb.jpg",
    "size": null,
    "width": null,
    "height": null,
    "filename": "our-uploads-thumb.jpg",
    "mime_type": "image/jpeg"
  },
  "medium": {
    "url": "https://example.com/our-uploads-med.jpg",
    "size": null,
    "width": null,
    "height": null,
    "filename": "our-uploads-med.jpg",
    "mime_type": "image/jpeg"
  },
  "original": null,
  "provider_id": null
}

vs:

{
  "full": {
    "url": "https://example.com/our-uploads-full.jpg",
    "filename": "our-uploads-full.jpg",
    "mime_type": "image/jpeg"
  },
  "thumb": {
    "url": "https://example.com/our-uploads-thumb.jpg",
    "filename": "our-uploads-thumb.jpg",
    "mime_type": "image/jpeg"
  },
  "medium": {
    "url": "https://example.com/our-uploads-med.jpg",
    "filename": "our-uploads-med.jpg",
    "mime_type": "image/jpeg"
  }
}

This results in a ton of extra space taken for millions of records. I feel I can accomplish what I need by storing raw maps, but would like to harness the utilities of schemas i.e. validation, custom types etc. Is this possible with embedded schemas?

thojanssens1 · November 23, 2020, 1:38pm

Did you try using a custom Ecto type? You can then filter out the empty values in your type.

onomated · November 23, 2020, 1:48pm

Haven’t tried custom types, as this doesn’t “feel custom”. I was expecting the behavior I outlined to be the default out of the box. But if that cannot be accomplished, I can consider custom types instead of embedded schemas. Wonder what the trade-offs are… For one, my codebase has a decent amount of inline embeds which are only relevant to the specific model. Moving those out to custom Ecto types loses that convenience.

Great suggestion to try though.

onomated · November 23, 2020, 3:04pm

@thojanssens1

After looking into the implementation of Polymorphic embeds:

And listening to this podcast on ParametizedTypes and the challenges of embeds:

I agree that the use custom types is the way to go since the on_replace option in this particular case is :delete. Wonder how to address this for embeds_many or in cases where on_replace is set to :update. Also will lose embed introspection which I rely on to generate my Absinthe schema.

al2o3cr · November 23, 2020, 4:40pm

Have you measured this? Not trying to be snarky, but PG is complicated enough that “obvious” things frequently aren’t.

onomated · November 23, 2020, 11:27pm

I haven’t measured. But jsonb (i.e. binary json) columns store not only “null” as values (which are likely optimized), but also the corresponding map keys (i.e. field names) for every record. These keys don’t hold any useful information, so still a waste. My particular case above is just one permutation of several factors i.e. number of fields in embeds and number of records. And I’m migrating data from another platform where json was “structured” but didn’t have the unneeded fields stored in the columns. So that’s driving my expectations here. I can measure and report something quantitative, but on a fundamental level, feels like having only the specified fields in the data is not a stretch of expectation

dimitarvp · December 14, 2020, 6:03pm

IMO you should measure anyway. I once thought I am killing my workstation with a DB that has ~10 million records in the entire database (spread over 20+ tables) so I did a before and after and my disk usage wasn’t even 4GB. And that was data accumulated over 3.5 years.

But, if you aren’t satisfied with that… then are you positive that those empty fields are actually stored in the DB at all? Could be that the Elixir code just instantiates a map and only fills the known fields. So you might be observing an artifact of Ecto and not the DB itself.

LostKobrakai · December 14, 2020, 6:19pm

This is not how ecto functions. Embeds are modeled to closely mirror working with assocs – and you cannot skip columns in tables as well. The other thing is: schema definitions are not suggestions. They’re a fixed format for how data has to look like. There’s no skipping keys, there’s just values not changed from the default. Just like the keys for an struct are always present. If you don’t want fixed structure you can always use :map or a custom type.

onomated · December 14, 2020, 6:40pm

This is a 50GB database I’m migrating that uses jsonb features quite extensively. So a lot of data. This isn’t a 1-to-1 migration as some of the json columns were migrated to actual table columns for better DB performance. , stale records pruned etc. Also, due to the migration time already spent prior to noticing the issue in a staging environment, I couldn’t restart and get actual numbers, but I would say it was in the order of a few gigs in extra space, and that’s with some columns migrated from json. Either way, its just more data added by virtue of the app layer that I wasn’t expecting. But I created a custom :map type that utilizes an embedded schema definition for validation, and that got me back to what was expected. I’ll share below.

Whoops sorry @dimitarvp, didn’t address your second point there. Yes, I’m sure the empty fields are stored in the DB as an artifact of Ecto’s embedded_schema representation. The migration code creates changesets from raw maps, not Ecto structs. The maps are only populated with “non-empty” fields. On writing to the DB, non-specified fields are populated with default/null values. The data model I’m migrating used jsonb in several instances for flexibility in modeling polymorphic fields that are present or not based on data context.

onomated · December 14, 2020, 6:49pm

Agreed, as I absolutely like that schema is enforced. I think the semantics I bring up here is how “empty/non-existent” is represented in the actual DB storage. The rails codebase I’m migrating from enforced jsonb schema, but represented empty (which is different from null), as truly empty in the storage.
But Ecto was awesome enough for me to get to the representation I desired. Here’s the custom type I created, which was inspired by the links I shared earlier. Only disadvantage of using the map type is that I cannot represent incremental additions to the stored data. Its all or nothing.

defmodule Ecto.CompactEmbed do
  @moduledoc """
  Custom ecto type for embedded map types that get compacted on saving to data.
  "Raw" ecto embedded schemas include all fields in the json representation in the database even if the field is not specified.
  This compacts the json fields by removing all null fields or applying the supplied `:on_compact` option.

  For more discussion see https://elixirforum.com/t/ecto-embedded-schema-storing-extra-default-values-for-fields-that-arent-specified/35739

  ## Options

    * `:schema` (*Required*) - The (embedded) schema module would otherwise have been specified as the field type.
      By default the schema's changeset/2 function is used to cast the data if provided.
      Otherwise the data is casted to only include the fields defined.

    * `:on_compact` - The function that is invoked to compact the map data before writing to the database.
      This must be function with an arity of 1 that takes the schema map data and compacts/modifies it.
      An MFA tuple may be passed as well.  For example `on_compact: {Utils, :remove_empty}` or `on_compact: {Utils, :remove_empty, []}`
      If this is not specified, all null fields will be removed by default

  ## Example:
    # Define schema of compacted embed
    defmodule Asset do
      use Ecto.Schema

      @primary_key false

      embedded_schema do
        field :url, :string
        field :filename, :string
        field :mime_type, :string
        field :size, :integer
        field :width, :integer
        field :height, :integer
      end
    end

    # Use in schema field
    defmodule Attachment do
      use Ecto.Schema

      schema "attachments" do
        field :asset, Ecto.CompactEmbed, schema: Asset
      end
  """

  # Loosely modeled after:
  #   https://github.com/elixir-ecto/ecto/blob/master/lib/ecto/embedded.ex
  #   https://github.com/mathieuprog/polymorphic_embed/blob/master/lib/polymorphic_embed.ex

  use Ecto.ParameterizedType

  alias __MODULE__

  @type t :: %CompactEmbed{
          schema: Ecto.Schema.t(),
          on_compact: fun() | {module(), atom()} | {module(), atom(), []}
        }
  defstruct [:schema, :on_compact]

  ## Type behavior

  @impl true
  def type(_params), do: :map

  @impl true
  def init(opts) do
    schema = Keyword.fetch!(opts, :schema)
    on_compact = opts[:on_compact]

    on_compact =
      case on_compact do
        nil ->
          &Utils.Presence.sans_nil(&1)

        {mod, fun} when is_atom(mod) and is_atom(fun) ->
          fn map ->
            apply(mod, fun, [map])
          end

        {mod, fun, []} when is_atom(mod) and is_atom(fun) ->
          fn map ->
            apply(mod, fun, [map])
          end

        compacter when is_function(compacter, 1) ->
          compacter

        _ ->
          raise ArgumentError,
                "invalid `:on_compact` option for #{inspect(on_compact)}. " <>
                  "This must be a function with an arity of 1 that takes a map and returns the compacted/modified map." <>
                  "An MFA may be passed as well.  For example `on_compact: {Utils, :remove_empty}` or `on_compact: {Utils, :remove_empty, []}`"
      end

    %CompactEmbed{schema: schema, on_compact: on_compact}
  end

  @impl true
  def cast(nil, _), do: {:ok, nil}

  def cast(attrs, %{schema: schema}) do
    schema
    |> cast_to_changeset(attrs)
    |> case do
      %{valid?: true} = changeset ->
        {:ok, Ecto.Changeset.apply_changes(changeset)}

      changeset ->
        {:error, build_errors(changeset)}
    end
  end

  @impl true
  def load(nil, _loader, _params), do: {:ok, nil}

  def load(data, _loader, %{schema: schema}) do
    struct =
      cast_to_changeset(schema, data)
      |> Ecto.Changeset.apply_changes()

    {:ok, struct}
  end

  @impl true
  def dump(nil, _dumber, _params), do: {:ok, nil}

  def dump(data, _dumper, %{on_compact: on_compact}) do
    case Ecto.Type.dump(:map, data) do
      {:ok, map} -> {:ok, compact_map(map, on_compact)}
      error -> error
    end
  end

  @impl true
  def embed_as(_, _), do: :dump

  @impl true
  def equal?(term1, term2, %{on_compact: on_compact}) when is_map(term1) and is_map(term2) do
    case MapDiff.diff(compact_map(term1, on_compact), compact_map(term2, on_compact)) do
      %{changed: :equal} -> true
      _ -> false
    end
  end

  def equal?(term1, term2, _params), do: term1 == term2

  ## Privates

  defp cast_to_changeset(%schema{} = struct, attrs) do
    if function_exported?(schema, :changeset, 2) do
      schema.changeset(struct, attrs)
    else
      embed_fields = schema.__schema__(:embeds)
      non_embed_fields = schema.__schema__(:fields) -- embed_fields

      Ecto.Changeset.cast(struct, attrs, non_embed_fields)
      |> cast_embeds_to_changeset(embed_fields)
    end
  end

  defp cast_to_changeset(module, attrs) when is_atom(module) do
    cast_to_changeset(struct(module), attrs)
  end

  defp cast_embeds_to_changeset(changeset, embed_fields) do
    Enum.reduce(embed_fields, changeset, fn embed_field, changeset ->
      Ecto.Changeset.cast_embed(
        changeset,
        embed_field,
        with: fn embed_struct, data ->
          cast_to_changeset(embed_struct, data)
        end
      )
    end)
  end

  defp compact_map(%_module{} = struct, on_compact) do
    struct
    |> Map.from_struct()
    |> compact_map(on_compact)
  end

  defp compact_map(map, on_compact) when is_map(map), do: on_compact.(map)

  defp build_errors(%{errors: errors, changes: changes}) do
    Enum.reduce(changes, errors, fn {field, value}, all_errors ->
      case value do
        %Ecto.Changeset{} = changeset ->
          Keyword.merge([{field, {"is invalid", changeset.errors}}], all_errors)

        _ ->
          all_errors
      end
    end)
  end
end

ericdude4 · February 24, 2021, 6:17pm

Hey @onomated, did you end up using this Ecto.CompactEmbed approach in your application? Was it successful? Did you end up making an open source project for this?

onomated · February 25, 2021, 2:35pm

@ericdude4 Yes I did and so far so good. I should open source as the sample code I shared above has some bugs since writing specs for it. Haven’t released a package for elixir before so this will be my first. Let me know if you want to work on this together. Regardless, I’ll create a package and let you know when its up.

catra · February 27, 2024, 8:14pm

Hey @onomated! I know is being a while from your last reply but I’m wondering if you could share a gist or something similar on how you ended building your Ecto.CompactEmbed.

onomated · February 27, 2024, 8:53pm

Oh yes! I meant to contribute this as a hex package. Only hesitation is that I haven’t released a hex package before. It ended up being a couple of modules that took heavy inspiration from Ecto embeds, and it’s been working great. Let me see if I can find guidance on releasing hex packages so I can release as version 0.0.1 and refine as others try it out