How to represent Keyword list in Ecto / Ecto.Schema

silverdr · May 3, 2024, 8:26pm

I have a data structure that is best represented as List (order matters) rather than Map but each element is a key/value pair so I thought of using Keyword list for that purpose as it fits the bill nicely. Now the question - how does one represent it in Ecto so that it serialises well into JSON column? Do I need to create custom type? Or how would you do it?

ruslandoga · May 4, 2024, 3:57am

@silverdr

Would a list of single KV maps be an option? It would probably be serializable to JSON automatically. It also doesn’t appear to be much more expensive than keyword lists.

iex(3)> :erts_debug.size [%{a: :b}, %{c: :d}]
16
iex(4)> :erts_debug.size [a: :b, c: :d]
10
iex(5)> :erts_debug.size Enum.map(1..100, fn i -> %{i => i} end)
800
iex(6)> :erts_debug.size Enum.map(1..100, fn i -> {i, i} end)
500

03juan · May 4, 2024, 6:51am

Does it matter what the data stored in the JSON looks like, do you need to issue db queries againt it?

If not then an option could be to unzip the KV list into a 2d array in the changeset operation before storing/updating, and post-process the repo query to zip it back up when taking it out again.

fruits = ["apple", "banana", "orange"]
counts = [3, 1, 6]
Enum.zip(fruits, counts)
[{"apple", 3}, {"banana", 1}, {"orange", 6}]

from zipping cheatsheet

Since you’re contemplating a Keyword list I assume your data structure has known atom fields, in which case you can also transform the keys into known atoms during the zipping, otherwise you’d run into atom table issues.

But if keys are known and have set order, then you could just store the values as a sparse array and rehydrate your Keyword list from there

This is probably a good candidate for a custom Ecro type, but without knowing more about your data structure it’s hard to give a more definitive answer.

silverdr · May 4, 2024, 7:13am

That’s how I initially did it but didn’t like what I had. Not even because of the cost you referred to. But rather because accessing values by keys becomes… well… ugly I guess is the word.

Say I have something like:

[a: 12, b: 102, c: 9]

vs.

[%{a: 12}, %{b: 102}, %{c: 9}]

How do I quickly get the value for b: (or whichever) key?

ruslandoga · May 4, 2024, 7:20am

iex(1)> kv = [%{a: 12}, %{b: 102}, %{c: 9}]
[%{a: 12}, %{b: 102}, %{c: 9}]
iex(2)> Enum.find_value(kv, fn kv -> kv[:b] end)
102

silverdr · May 4, 2024, 7:22am

Although I hope(d) for an array of JSON “objects” what you say is probably the easier, possible scenario. Yes, the keys come from a known, limited set and splitting keys and values into separate arrays would allow for relatively easy pre/post processing. “Not good, not terrible”

silverdr · May 4, 2024, 8:17am

Heh… yes, there’s already a function which hides all the ugliness under its hood! But at least it doesn’t hurt the eyes. Thank you, this might be an option in this case - it’s not a traffic heavy spot in the application so let’s see

03juan · May 4, 2024, 8:26am

Because I can’t leave a thread flapping around in my head I tried things out and it’s not easy to use the existing Ecto functionality to store an array of arrays, but you can approximate it with embedded structures and functions to marshall between your data. For example:

defmodule KWList do
  use Ecto.Schema
  import Ecto.Changeset

  schema "kwlists" do
    embeds_many :kw_pairs, Pair do
      field :key, :string
      field :value, :integer
    end

    field :data, :any, virtual: true

    timestamps(type: :utc_datetime)
  end

  @doc false
  def changeset(kw_list, attrs) do
    attrs = data_to_pairs(attrs)

    kw_list
    |> cast(attrs, [])
    |> cast_embed(:kw_pairs, with: &pairs_changeset/2)
  end

  def data_to_pairs(attrs) do
    {data_key, pairs_key} = typed_keys(attrs)

    pairs = attrs[data_key] || []

    kw_pairs =
      for {k, v} <- pairs do
        %{key: k, value: v}
      end

    Map.put(attrs, pairs_key, kw_pairs)
  end

  def typed_keys(attrs) do
    case Enum.take(attrs, 1) do
      [{k, _}] when is_binary(k) -> {"data", "kw_pairs"}
      _ -> {:data, :kw_pairs}
    end
  end

  def pairs_changeset(schema, attrs) do
    schema
    |> cast(attrs, [:key, :value])
  end

  def convert_keys(%KWList{kw_pairs: pairs} = schema) when is_list(pairs) do
    data =
      for %{key: k, value: v} <- pairs do
        {k, v}
      end

    %{schema | data: data}
  end

  def convert_keys(%KWList{} = schema) do
    schema
  end
end

edit: Note I fixed a bug in the typed_keys case statement, match should be [{k, _}] when is_binary(k), not {k, _} when ...

iex(2)> c = KWList.changeset(%KWList{}, %{data: [{"d", 2}, {"b", 3}]})
#Ecto.Changeset<
  action: nil,
  changes: %{
    kw_pairs: [
      #Ecto.Changeset<
        action: :insert,
        changes: %{value: 2, key: "d"},
        errors: [],
        data: #KWList.Pair<>,
        valid?: true
      >,
      #Ecto.Changeset<
        action: :insert,
        changes: %{value: 3, key: "b"},
        errors: [],
        data: #KWList.Pair<>,
        valid?: true
      >
    ]
  },
  errors: [],
  data: #KWList<>,
  valid?: true
>
iex(3)> Repo.insert(c)
...logs...
{:ok,
 %KWList{
   __meta__: #Ecto.Schema.Metadata<:loaded, "kwlists">,
   id: 2,
   kw_pairs: [
     %KWList.Pair{
       id: "fc0e5ca8-a369-4b48-ae15-a80834bbb122",
       key: "d",
       value: 2
     },
     %KWList.Pair{
       id: "166f7beb-4830-4506-9070-3e4a173aaf89",
       key: "b",
       value: 3
     }
   ],
   data: nil,
   inserted_at: ~U[2024-05-04 08:21:47Z],
   updated_at: ~U[2024-05-04 08:21:47Z]
 }}


iex(4)> Repo.all(KWList) |> Enum.map(&KWList.convert_keys/1)
[debug] QUERY OK source="kwlists" db=6.4ms queue=3.6ms idle=1924.7ms
SELECT k0."id", k0."kw_pairs", k0."inserted_at", k0."updated_at" FROM "kwlists" AS k0 []
↳ :elixir.eval_external_handler/3, at: src/elixir.erl:405
[
  %KWList{
    __meta__: #Ecto.Schema.Metadata<:loaded, "kwlists">,
    id: 2,
    kw_pairs: [
      %KWList.Pair{
        id: "fc0e5ca8-a369-4b48-ae15-a80834bbb122",
        key: "d",
        value: 2
      },
      %KWList.Pair{
        id: "166f7beb-4830-4506-9070-3e4a173aaf89",
        key: "b",
        value: 3
      }
    ],
    data: [{"d", 2}, {"b", 3}],
    inserted_at: ~U[2024-05-04 08:21:47Z],
    updated_at: ~U[2024-05-04 08:21:47Z]
  }

And here’s how it looks in the db:

# \d kwlists
                                          Table "public.kwlists"
   Column    |              Type              | Collation | Nullable |               Default               
-------------+--------------------------------+-----------+----------+-------------------------------------
 id          | bigint                         |           | not null | nextval('kwlists_id_seq'::regclass)
 kw_pairs    | jsonb                          |           |          | 
 inserted_at | timestamp(0) without time zone |           | not null | 
 updated_at  | timestamp(0) without time zone |           | not null | 

# select kw_pairs from kwlists;
                                   kw_pairs                                   
-----------------------------------------------------------------------------------
 [{"id": "uuid1", "key": "d", "value": 2}, {"id": "uuid2", "key": "b", "value": 3}]

03juan · May 4, 2024, 8:28am

Or just:

schema "kwlist" do
  field :keys, {:array, :string}
  field :values, {:array, :integer}
end

and always make sure both arrays are mutated at the same time.

Though converting to known Keyword list with atoms makes traversing to find data much easier than writing your own reducers and manual loops.

arcyfelix · May 4, 2024, 8:41am

With this solution, you could have a scenario where there are more keys than values and vice versa, right?

silverdr · May 4, 2024, 8:41am

Love the attitude - highly appreciated!

Huh, yes - I already was halfway there with array of maps but that seems like an overkill for a theoretically simple problem, doesn’t it?

BTW in Ruby, starting with a don’t remember exactly which version (2.0?) the guys made Ruby hashmaps always preserve the order of keys. That one step made this class of problems as here simply disappear.

03juan · May 4, 2024, 8:47am

As always, it depends on the needs. The embedded schema implementation at least uses the pre-existing Ecto niceties and you can decide what structure makes the most sense for your real business requirements.

Ah yes, good old hashmapwithindifferentaccess, unfortunately we have to contend with erlang’s map implementation in C, but hey, you can always convert the list of tuples to gb_trees

03juan · May 4, 2024, 8:49am

Yes, exactly why the schema should just be a way to store the data in the DB, and you’d write a module for the real data structure that hides these implementation details. Like MapSet is just opaque Map under the hood.

silverdr · May 4, 2024, 9:51am

No, that’s a different thing (and not a Ruby one AFAIR). Indifferent access means that accessing may be done using either atoms or binaries (indifferent) without having to worry how the hashmap was created. What I refer to is LMGTFM… this:

https://ruby-doc.org/3.3.1/Hash.html#class-Hash-label-Entry+Order

I think it was introduced in 2.0 some years ago and I even recall I used it in similar problem and guy called me on that during code review, until I showed that it’s a documented feature already

fmn · May 4, 2024, 10:02am

how quick it is, for that data type is relative to its size (or rather length), because you need to scan it whole or until you will find :b, if those “keys” are unique.

to find it really quickly, you probably would need to use a MapSet or a Map, i guess?

EDIT: or a pattern match if you know its length? but i am not sure how fast it is.

thousandsofthem · May 4, 2024, 10:09am

Very much depends on exact restrictions.
If keys can not be repeated, just do plain map (jsonb).
If you need to store exact value type, serialize it and store as binary blob

:erlang.term_to_binary([a: 5, b: "five", c: :five])

joey_the_snake · May 4, 2024, 10:26am

In my opinion the best way to solve issues like this is to first decide how you would represent it in the database if you weren’t using Ecto. Keep in mind Ecto is just defining a mapping between Elixir and your persistence layer. If you don’t know what you want to do in your persistence layer you don’t know what you want to do in Ecto.

silverdr · May 4, 2024, 10:27am

Keys are unique but their order is important. I’d love to just serialise the Map and be done but in Elixir (Erlang/OTP) that does not guarantee that the order will be preserved.

It needs to fit into an existing JSON structure. While it doesn’t really need to be human-readable, it would still be nice to have it as an array of JSON “objects” along with everything else. But yes, I guess this could be doable this way, although this would be the only unreadable JSON part then.

silverdr · May 4, 2024, 10:35am

That’s how I decided I’d prefer to have it but I ran into limitations. I did implement it as a List of single-element Maps but didn’t like the complexity and ineffectiveness of handling those. Still - after reading the comments and realising (thank you @ruslandoga) that I don’t have to reinvent the wheel this seems like the “least bad” approach, short of custom types

fmn · May 4, 2024, 10:47am

if you want to encode it to native JSON types, 1 to 1 - it’s not possible.
if you are flexible, you can encode it into JSON array of two element arrays.

if you want to encode anything more complex - like “ordered sets” or “indexed lists” , you need to:

find a way to implement your data type in Elixir (there are no types in Elixir as far as i know supporting all the properties you are looking for; maybe you can find a wrapper around Erlang’s gb_tree?)
invent a way of encoding it into JSON and decoding it from JSON (because at “type-to-type” level, types you are looking for are incompatible with JSON types).

open question remains: what’s the cost of integrating it into Ecto?