Storing a list of positions in a column with Ecto

Harrygr · January 23, 2022, 5:37pm

I’d like to store a list of timeseries data in a column in my Ecto schema. Let’s call it positions.

I’ve tryed using embeds_schema and this works pretty well, but it does seem to be a bit inefficient as each item in the list has a UUID and keys for the position fields. E.g.

%PositionSample{
    id: "e7de8428-5d71-4b07-beb2-3a772407f262",
    lat: 51.41985923051834,
    lng: -0.06825794465839863,
    timestamp: ~U[2021-11-21 11:10:59Z]
  }

There can be many PositionSamples for a single record so I figured it might be worth it to try and optimize this a bit.
I thought I might try and store the samples as a list of tuples (serialized as arrays when in json format). E.g.

["2021-11-21 11:10:59Z", [51.41985923051834, -0.06825794465839863]]

is this worth it, or is this a case premature optimisation?
What tradeoffs are there? With embedded schemas it seems like Ecto handle the json parsing and casting to schema, as well as UUID generation. I guess I’d have to implement the appropriate types etc.

shad · January 23, 2022, 8:26pm

You can prevent Ecto from generating an id (the UUID in your case, which is, by default, the primary key), using the schema attributes @primary_key. In your case, if i understand correctly, you’d do:

@primary_key false
embedded_schema do
# schema definition
end

Concerning premature optimisations, i’d say yes (assuming you’re using Postgres). In any case, considering you seem to have a well defined list of fields, meaning, JSON fields in Postgres strength to handle difficult to predict data structure isn’t really the needed tool here, i’d go for another table before anything else, if i was concerned with performance. For a small set of data though, that you don’t need to query directly (or too often, or that require doing things like using geospatial data in your case), it’s fine.

dimitarvp · January 23, 2022, 9:48pm

If you are using PostgreSQL it does quite fine with storing and searching JSON-like objects so I’d intuitively bet that you are doing premature optimization here.

kartheek · January 24, 2022, 1:02am

Few questions:

How often do you create these positions or you just import from some other source.
Will a position sample be updated after it is created?
How big can these positions grow inside a row - saw if you are embedding in a jsonb.
What kind of queries are you going to run on the data.

JSONB Considerations

I have cloned Kinto in Elixir using Ecto and Postgres. Some of my observations after working on that project:

Storing is easy in Ecto.
Json schema validation can be using libraries by JsonXema.
Queries involving JSON data are little tricky using Ecto.
Query involving a json attribute will use fragment()

Some sample queries:

op == Filter.contains() and is_list(value) ->
    query
    |> where([o], not is_nil(fragment("? #> ?", o.data, ^fields)))
    |> where([o], fragment("? #> ? @> ?", o.data, ^fields, ^value))

# array types
op == Filter.contains_any() and is_list(value) ->
    query
    |> where([o], not is_nil(fragment("? #> ?", o.data, ^fields)))
    |> where([o], fragment("jsonb_typeof(? #> ?) = 'array'", o.data, ^fields))
    |> where(
          [o],
          fragment(
            "(SELECT array_agg(elems) FROM jsonb_array_elements(? #> ?) elems) && (?)::jsonb[]",
            o.data,
            ^fields,
            ^value
          )
        )

Indexes can be created on JSONB attributes.

Write some queries if you have to use JSONB in plain SQL and try them from ecto.

03juan · January 24, 2022, 6:46am

If you’re storing timescale and position data one would assume that at some point you’d want to filter that data by location or time windows, in which case I’d definitely go for a separate table for easier bound, indexed lookups. And probably with PostGIS if location searches are part of your core business model.

Harrygr · January 24, 2022, 9:40am

shad:

You can prevent Ecto from generating an id (the UUID in your case, which is, by default, the primary key), using the schema attributes @primary_key. In your case, if i understand correctly, you’d do:
@primary_key false
embedded_schema do
# schema definition
end
Concerning premature optimisations, i’d say yes (assuming you’re using Postgres). In any case, considering you seem to have a well defined list of fields, meaning, JSON fields in Postgres strength to handle difficult to predict data structure isn’t really the needed tool here, i’d go for another table before anything else, if i was concerned with performance. For a small set of data though, that you don’t need to query directly (or too often, or that require doing things like using geospatial data in your case), it’s fine.

This seems like the best approach. It’s nice that embedded schemas give you validation and casting functionality so it makes sense to use it.

Harrygr · January 24, 2022, 9:44am

I’m currently using a separate table for points, but insertion is slow and querying can return thousands of rows. The points represent a path will only ever be used collectively.

I think PostGIS etc is overkill for my use case. I don’t have any need to search or filter by location.