Looking for reviewers: a howto guide on associations

mfrasca · June 25, 2019, 3:37pm

Hi all. I’m writing a new page on navigating associations, it would become a new chapter in the ecto/guides/howtos.

I would very much appreciate comments and (sharp) constructive criticism, you can add it here, or in the pull request, also as suggested patches, I guess.

The document guides you building a database, adding data to it from SQL, then navigating it using associations, from the iex shell. Then we shift focus on a different type of association, so that I am also showing how to migrate the database schema at each step.

I wish to include migrating the content, too: regarding this last point, the data migration, I do not quite manage to appreciate the relevance of the documentation I am reading.

mfrasca · June 25, 2019, 6:28pm

I stumbled upon this post, looking interesting.

mfrasca · June 25, 2019, 7:23pm

so I have the impression I’m following the logic. the difficulty now here is that I want to update two tables at the same time, so this is more or less what I mean.

I am adding a new table (accessions), and I’m splitting the field plants.name according to the pattern (YYYY.####).(#+), where the trailing digit (or number) after the last dot becomes the new field plants.code, and the leading two parts, except the second dot are the accessions.code. The accession is created on the fly, from the plants data, but it might have already been created from a previous plants record.

My first guess would look like this, but I do not know how to create or lookup a record in a different table than I am scanning.

defmodule Botany.Repo.Migrations.PlantNameToAccessionCode do
  use Ecto.Migration

  def up do
    create table(:accessions) do
      add :code, :string
      add :species: string
      add :orig_quantity, :integer
      add :bought_on, :utc_datetime
      add :bought_from, :string
    end

    alter table(:plants) do
      add :code, :string
      add :accession_id, references(:accessions)
    end

    flush()

    from(p in "plants",
      update: [set: [code:
                     p.name
                     |> (&(Regex.run(~r{(.*)\.([^\.]*)}, &1))).()
                     |> Enum.at(2)]]
    ) |> MyApp.Repo.update_all([])

    alter table(:plants) do
      remove :name
    end
  end
.
.
.
end

now the point is, I see how to alter the plants content, but how do I copy all the field content from plants to accessions, and how do I get accessions.id for the matching record, that I would store in plants.accession_id?

wolfiton · June 25, 2019, 8:12pm

I think this link that shows ecto joins should help you https://www.pompecki.com/post/sql-joins-with-ecto/.

Also maybe do a couple of select with the info you need and stored them in some atoms then add them to the query.

mfrasca · June 26, 2019, 12:27am

hi @wolfiton, it is indeed helpful. I am doing something like this now:

    accessions = from(p in "plants", select: [:id, :bought_on, :bought_from, :name]) |>
      Botany.Repo.all |>
      Enum.map(&(%{id: &1.id, 
                   bought_on:     &1.bought_on,
                   bought_from:     &1.bought_from,
                   code: Enum.at(Regex.run(~r{(.*)\.([^\.]*)}, &1.name), 1),
                  }))
    Botany.Repo.insert_all("accessions", accessions)

now I have a couple of doubts.

one is formal: if there’s a way to |> the accessions list to insert_all, but how do I do that since it should reach it as its second argument?
another is about the accessions list … I need to remove duplicates, I only need the first occurrence of any given code. I would not know how in Elixir.
still an other, about the use of models (Botany.Plant), or straight table names ("plants"). I guess I should not use models in the migrations, since I’m writing the migration today, and at some later stage the model could disappear. or what’s the deal?

about the |>, I could define an ad hoc insert_all function, where the parameters are inverted, but would it be more readable? or are there by any chance also the forms |2>, |3>, etc? lots of syntactic sugar I’m afraid.

mfrasca · June 26, 2019, 2:08am

I have the feeling of being still miles away from understanding the underlying logic.

if I’m in the middle of a migration, I do not see how I can rely upon the data abstraction contained in my modules, given that the migration file is a static object, while the module and its data model can be edited in time. (or do I define a local model, valid only within the migration?)

let’s say that the above “accession” migration works, once I solve the few issues, but how do I update the “plants” table?

get_third = fn(x) -> Enum.at(Regex.run(~r{(.*)\.([^\.]*)}, x), 2) end
collection = from(p in "plants", select: [:id, :name]) |> 
    Botany.Repo.all |>
    Enum.map(fn(%{id: i, name: x})->%{id: i, code: get_third.(x)} end)

and then what?

my earlier guess was “easier”, but the system kept complaining about Enum not being allowed in that position, for at needed be a macro.

from(p in "plants", select: [:id, :name],
      update: [set: [code: Enum.at(Regex.run(~r{(.*)\.([^\.]*)}, p.name), 2),
                     accession_id: p.id]]
    ) |> Botany.Repo.update_all()

the complete error message: ** (Ecto.Query.CompileError) `Enum.at(Regex.run(~r"(.*)\\.([^\\.]*)", p.name()), 2)` is not a valid query expression. If you want to invoke Enum.at/2 in a query, make sure that the module Enum is required and that at/2 is a macro

wolfiton · June 26, 2019, 7:25am

I am sorry but for the moment I can’t help you much.I am also a new to elixir and phoenix and trying to make something first get a grasp of all the ecosystem.

Hope that someone else with more experience will provide further assistance to you.

Best of luck

peerreynders · June 26, 2019, 11:51am

Is at the core of the problem:

Enum.at(Regex.run(~r{(.*)\.([^\.]*)}, p.name), 2)

isn’t something that Ecto can “translate” so that the database can do it (i.e. the database can’t run Elixir code).

PostgreSQL does have regex functionality

=# SELECT substring('2018.0044.2' from '\d+$');
 substring 
-----------
 2
(1 row)
=# SELECT substring('2018.0044.2' from '^\d+\.\d+');
 substring 
-----------
 2018.0044
(1 row)

set expects a value. I don’t know if set will accept a fragment (example).

But the idea is to give Ecto the information it needs so that it can generate a single update statement the database can understand that ends up being something like:

UPDATE plants SET
code = substring(name from '\d+$'),
accession_id = id;

Another option is to just execute the raw SQL.

iex(1)> query =
...(1)>   from(p in "plants",
...(1)>     update: [set: [code: fragment("substring(? from \'\d+$\')", p.name)]],
...(1)>     update: [set: [accession_id: p.id]]
...(1)>   )
#Ecto.Query<from p0 in "plants",
 update: [set: [code: fragment("substring(? from '\d+$')", p0.name)]],
 update: [set: [accession_id: p0.id]]>
iex(2)> Ecto.Adapters.SQL.to_sql(:update_all, Repo, query)
{"UPDATE \"plants\" AS p0 SET \"code\" = substring(p0.\"name\" from '\d+$'), \"accession_id\" = p0.\"id\"",
 []}
iex(3)

Filling the accessions table probably looks something like this

INSERT INTO accessions (id, bought_on, bought_from, code)
(SELECT p.id, p.bought_on, p.bought_from, substring(p.name from '^\d+\.\d+') FROM plants AS p
INNER JOIN (SELECT min(t.id) AS id FROM plants AS t GROUP BY substring(t.name from '^\d+\.\d+')) AS u ON p.id = u.id);

I’m not aware of a way to “build” this type of query from Ecto, so execute and query/4 may be options.

Alternately you could use a query to create the structures you need to run in a separate insert:

iex(1)> sub = from(t in "plants", select: %{id: min(t.id)}, group_by: fragment("substring(? from \'^\d+\.\d+\')", t.name))
#Ecto.Query<from p0 in "plants",
 group_by: [fragment("substring(? from '^\d+.\d+')", p0.name)],
 select: %{id: min(p0.id)}>
iex(2)> query =
...(2)>   from(p in "plants",
...(2)>     select: %{id: p.id, bought_on: p.bought_on, bought_from: p.bought_from, code: fragment("substring(? from \'^\d+\.\d+\')", p.name)},
...(2)>     join: u in subquery(sub), on: u.id == p.id 
...(2)> )
#Ecto.Query<from p0 in "plants",
 join: p1 in subquery(from p0 in "plants",
  group_by: [fragment("substring(? from '^\d+.\d+')", p0.name)],
  select: %{id: min(p0.id)}),
 on: p1.id == p0.id,
 select: %{id: p0.id, bought_on: p0.bought_on, bought_from: p0.bought_from, code: fragment("substring(? from '^\d+.\d+')", p0.name)}>
iex(3)> Ecto.Adapters.SQL.to_sql(:all, Repo, query)
{"SELECT p0.\"id\", p0.\"bought_on\", p0.\"bought_from\", substring(p0.\"name\" from '^\d+.\d+') FROM \"plants\" AS p0 INNER JOIN (SELECT min(p0.\"id\") AS \"id\" FROM \"plants\" AS p0 GROUP BY substring(p0.\"name\" from '^\d+.\d+')) AS s1 ON s1.\"id\" = p0.\"id\"",
 []}
iex(4)>

As such your migration plan still doesn’t quite work because of the accession_id. You should probably

First generate the accessions records from the plants table, auto generating the ids in the accessions table
Populate the plants.accession_id by performing a join between accessions.code and substring(plants.name from '^\d+\.\d+'))
Populate plants.code from substring(plants.name from '\d+$'))

mfrasca · June 26, 2019, 7:37pm

Hi @peerreynders, I just managed to read your message, we’ve been disconnected for the whole morning (I’m in rural Panamá, which is sort of disaster area these days, for no reason at all).

while disconnected I had worked out something similar to your 3-steps schema, except for the middle step, which looks quite more efficient the way you pose it. I’m unsure on your step 1, for how do I make sure we don’t create two records for the same accession code?

what I’m doing now is this (I also just managed to push this to github):

  def up do
    create table(:accession) do
      add :code, :string
      add :species, :string
      add :orig_quantity, :integer
      add :bought_on, :utc_datetime
      add :bought_from, :string
    end

    alter table(:plant) do
      add :code, :string
      add :accession_id, references(:accession)
    end

    flush()

    from("plant", select: [:id, :bought_on, :bought_from, :name, :species, :accession_id, :code]) |>
      Botany.Repo.all |>
      Enum.each( TODO <%%> FILL IN THE BLANKS )

    alter table(:plant) do
      remove :name
      remove :species
    end
  end

and the function to fill in the blanks is here:

    import Ecto.Query
    split_plant_create_accession = fn(plant) ->
      [_, acc_code, plt_code] = Regex.run(~r{(.*)\.([^\.]*)}, plant.name)
      query = from(a in "accession", select: [:id, :code], where: a.code==^acc_code)
      accession = case (query |> Botany.Repo.one()) do
                    nil -> (accession = %{id:         plant.id,
                                         bought_on:   plant.bought_on,
                                         bought_from: plant.bought_from,
                                         code:        acc_code,
                                         species:     plant.species,
                                         };
                      Botany.Repo.insert_all("accession", [accession]);
                      accession)
                    x -> x
                  end
      from(p in "plant", where: p.id==^plant.id, select: p.id) |>
        Botany.Repo.update_all(set: [accession_id: accession.id, code: plt_code])
    end

I’ll experiment with your code, but not the part with raw SQL, I don’t think it would be a good way to explain ecto, telling people to use raw SQL.
I guess the only advantage of the structure in what I wrote is that it’s easy to follow.
fragments are new to me. I’ll have a look, thank you for that, too.

peerreynders · June 26, 2019, 8:12pm

I don’t think it would be a good way to explain ecto, telling people to use raw SQL.

Ecto.Query is a query builder. Knowing SQL helps with Ecto.Query. Personally I tend to start with the SQL that I want and then figure out how to build the query with Ecto so that it can handle the annoying details. And there are edge cases that Ecto doesn’t cover, hence the existence of features like fragment and execute.

how do I make sure we don’t create two records for the same accession code?

GROUP_BY substring( plants.name from '^\d+\.\d+') is preparation for running an aggregate function on each group of records with the same ####.#### name prefix (the accession code - if I understand correctly). The aggregate function is min(plants.id) - meaning we just want the lowest id value for the whole set of records which share the same ####.#### prefix. So the query only returns one id for each set of ####.#### records (the lowest one). So this query simply returns a list of ids, each referring to a unique ####.#### pattern. This query is used as a subquery.
The next query uses that list of ids to create the data necessary for the accession records.
The result of the complete query should be a list of the data that is needed to insert the accession records - one record for each unique ####.#### pattern.

The advantage of this approach is that the database prepares the data for you in a single query against plants - you then just turn around and insert the data it generated into accessions.

mfrasca · June 26, 2019, 8:59pm

hello @peerreynders. I had missed the min part, sorry, it’s clear how it works.

I will probably use my version to explain the steps, highlight the inefficiency, and use your version as a second iteration. I’m not so sure how to make this a beginners friendly text. we will see. thank you for your time!

I will likely use the fragment part for the next migration, linking accession to taxon. And I’m afraid I will use your hint “write the SQL first!” this time.

mfrasca · June 26, 2019, 10:05pm

I don’t know, but what I see is that we are dealing with schema-less queries, so we should be able to handle the result to a insert_all. Let me try … yes, this part works fine. Except for the order of the arguments, since the collection goes second, so I can’t pipe |> it. Unless you know how to do that.

acl = from(t in "plant", 
    select: %{id: min(t.id), 
              code: fragment("substring(? from '^\\d+\\.\\d+')", t.name)}, 
    group_by: fragment("substring(? from '^\\d+\\.\\d+')", t.name)) |> 
  Botany.Repo.all
Botany.Repo.insert_all("accession", acl)

(oh, sorry, I was getting so confused with plural forms, that I switched to singulars)

mfrasca · June 26, 2019, 10:32pm

I don’t see how. I have this SQL, but how do I make this Ecto-like?
update plant p set accession_id=(select id from accession where code=substring(p.name from '^\d+\.\d+'));
or you meant something else?

anyhow, I would like to help the reader, and I fear we’re making things more complex. which is definitely being useful to me. I’ll close it for today.

peerreynders · June 27, 2019, 12:19am

I don’t see how.

UPDATE plant SET
  accession_id = sq.accession_id
FROM (
  SELECT
    a.id AS accession_id,
    p.id AS plant_id
  FROM
    accession AS a
    INNER JOIN plant AS p ON a.code = substring(p.name from '^\d+\.\d+')
) AS sq
WHERE
  id = sq.plant_id

Again, I’m not aware how to do that directly in Ecto, but … something like

def update_plant_accession(%{accession_id: accession_id, plant_id: plant_id}) do
  query =
    from(p in "plant",
      update: [set: [accession_id: ^accession_id]],
      where: p.id == ^plant_id
    )
  
  Repo.update_all(query, [])
end

def select_plant_accession() do
  query =
    from(p in "plant",
      join: a in "accession", on: fragment("substring(? in \'\d+\.\d+\')", p.name) == a.code,
      select: %{plant_id: p.id, accession_id: a.id}
    )
  
  Repo.all(query)
end

def run(), 
  do: Enum.each(select_plant_accession(), &update_plant_accession/1)

so I can’t pipe |> it. Unless you know how to do that.

Don’t worry about the pipeline operator. It’s supposed to be a convenience, not an obligation/stylistic obsession.

Also stylistically you are supposed to avoid single pipes:

What bothers me about your version is that you are inserting the plant.id as the accession.id and I expect that is going to create problems with the autogenerated Accession.id. Also it may be easier to just use distinct/3. Something like …

def select_plant_accession_codes() do
  query =
    from(p in "plant",
      select: %{code: fragment("substring(? in \'\d+\.\d+\')", p.name)},
      distinct: fragment("substring(? in \'\d+\.\d+\')", p.name)
    )

  Repo.all(query)
end

def run(), 
  do: Repo.insert_all("accession", select_plant_accession_codes())

mfrasca · June 27, 2019, 12:49am

looks like that the fragment approach works here, too.
but what is the biggest counter-indication? there must be a counter indication, not?

q = from(p in "plant", 
  update: [set: [
    accession_id: fragment(~S"(select id from accession where code=substring(? from '^\d+\.\d+'))",
                           p.name)]])  
Botany.Repo.update_all(q, [])

there is no need to select anything either, not even the name we use in the fragment.

mfrasca · June 27, 2019, 12:50am

it bothers me to, only slightly though. but I’ll consider it. thank you.

peerreynders · June 27, 2019, 12:55am

there must be a counter indication, not?

In my opinion the fragment is now so large that you are close to issuing raw SQL.

And if we are issuing raw SQL I prefer my version as there is only a single SELECT - your version issues a separate SELECT for each row to be updated.

mfrasca · June 27, 2019, 1:58pm

it might be interesting doing some timings. my version has indeed a select matching a single row in the accession table for each plant row, but isn’t it the same as your inner join? we are both working with or without an index (I should include in some of the add instruction that we want indexes), so I’m not sure there should be any difference. but I have no clue how to measure it.

mfrasca · June 27, 2019, 8:00pm

I’ve been editing the text, trying to summarize the hints I received here.
if you from Elixir/Ecto want to express your opinion on whether it fits in the documentation site?
some native speaker reviewing the English form?
I even tried to put a capital letter at the beginning of each sentence!
about formatting, I would not know.

about the subject I chose, well, I’m using this myself, and to do this it felt as if I had to cover practically all possible aspects of migrations and associations. apart from the planned one_to_one, I’m anyway not done yet: I miss the ‘on delete -> action’, the length of the strings, and we definitely need to add links to reference material.

so let me renew the invitation to come to the pull request and add additional hints or commits there. you can also put a pull request on my fork if that’s easier for you.

peerreynders · June 29, 2019, 12:11am

Benchmarking will give more clarity in each particular case.

A priori I’m guided by the following heuristics:

prefer joins over subqueries
prefer one subquery over many

That second one leads to a particular style where rather than using subqueries to “fetch that one value I need”, one gathers all the necessary data with one query into something like a “virtual table” that then can be used to drive the selection of further data in the query that wraps it.

So

  SELECT
    a.id AS accession_id,
    p.id AS plant_id
  FROM
    accession AS a
    INNER JOIN plant AS p ON a.code = substring(p.name from '^\d+\.\d+')

is used to generate the needed result set of (plant_id, accession_id) tuples - which in turn can drive what the wrapping UPDATE needs to do:

UPDATE plant SET
  accession_id = sq.accession_id
FROM (
  ...
  (plant_id, accession_id)
  ...
) AS sq
WHERE
  id = sq.plant_id

There aren’t any guarantees performance-wise but more often than not, it’s not too bad.