Preloading nested associations leads to less results

Hello.

I want to mass-dump orders in CSV reports. Simply put, when I do left_join and preload clauses in a single Ecto.Query for nested associations, I get less results compared to when I only preload direct associations. My test is with offset of 0 and limit of 1000. Tinkering with disabling various nested associations changes the number of received records anywhere from 10 to 300. Simplified example follows below.

Ignore the anti-pattern of using offset and limit for now, please. I am asking my question here because I can’t figure out how to properly use Repo.stream to mass-encode orders as CSV rows but that takes a backseat for now. The problem at hand is that my first simplified code (that doesn’t use Repo.stream) can’t even rely on receiving 1000 records per batch (all except the last one, of course).

(NOTE: I am using PostgreSQL 9.6)

defmodule Context.Order do
  def eager(query) do
    from (x in query,
      left_join: user in assoc(x, :user),
      left_join: verification_details in assoc(user, :verification_details)
      preload: [
        user: {user, verification_details: verification_details}
      ]
    )
  end

  def crippled_eager(query) do
    from (x in query,
      left_join: user in assoc(x, :user),
      preload: [
        user: user
      ]
    )
  end

  def payment_captured(query) do
    from(x in query, where: x.payment_state in ^~w(paid credit_owed balance_due))
  end
end

The structure is extremely boringly classic: Order has an user_id column and VerificationDetail has an user_id column as well, so it goes like this: an order has one user which in turn has many verification details. I want to preload all that with a single query and I followed the common wisdom. However, in iex:

iex> records1 = Order |> Context.Order.payment_captured |> Context.Order.eager |> offset(0) |> limit(1000) |> Repo.all
iex> IO.puts length(records1)
268
:ok
iex> records2 = Order |> Context.Order.payment_captured |> Context.Order.crippled_eager |> offset(0) |> limit(1000) |> Repo.all
iex> IO.puts length(records2)
1000
:ok

(One observation: since this Elixir app I am creating is aimed to be a partial clone of a Rails Spree mastodont, I can compare SQL queries between Ecto and ActiveRecord. I noticed that AR does inner joins on all immediate associations and left outer joins on all nested associations, while Ecto only does left outer joins.)

Any idea why the difference of record counts above might happen at all? I am out of ideas. Also, in the eager query I commented out the preload part and it will still yield less than 1000 records (apparently the problem is the left_join on a nested association?). So, really, what gives?

Any hints are greatly appreciated!
// Dimi.

I would probably look what SQL are generated by Ecto and check by hand if they are correct. I would also write SQL by hand to return me what what I need and compare this query with the one generated by ecto. Sadly i’m not proficient in ecto enough to tell on spot what may be wrong.

The problem is that Ecto.Adapters.SQL.to_sql(:all, MyRepo, my_query) is giving me a non-executable code: ..."payment_state" = ANY($1) ... ["paid", "credit_owed", "balance_due"]. When I copy the entire SQL expression and manually correct that to ..."payment_state" IN ('paid', 'credit_owed', 'balance_due'), and execute that through Ecto or inside psql, both queries above work as expected.

NOTE: I am using PostgreSQL 9.6.

Still, I tried it again. The only adjustment is pointed above: correcting the pseudo-SQL of = ANY($1) to IN (...) and both queries return 1000 rows in psql.

Hmm, I’m unsure myself… The output result should be the same count (though different structure)… @michalmuskala any ideas?

Update, after implementing a recommendation from @josevalim:

He advised me to run the query that yielded less than 1000 results, copy the SQL in a variable, and then execute it like so:

query = Order |> Context.Order.payment_captured |> Context.Order.eager |> offset(0) |> limit(1000)
sql = Ecto.Adapters.SQL.to_sql(:all, Repo, query)
results = Ecto.Adapters.SQL.query!(Repo, elem(sql, 0), elem(sql, 1))
length results.rows

That returns 1000 as expected. Jose told me that if that way returned the proper amount of rows but the original Ecto query doesn’t, that means there is a bug in the preloader.

How should we proceed forward in helping Ecto contributors fix this? Obviously I can just use the above way and I am mostly OK with it, but how will this forum thread help a future fix? I cannot just dump a customer’s database in a bug report.

Honorable mention / CC: @michalmuskala

2 Likes

The easiest way for us to fix it is if we have a mechanism to reproduce the error. It can even be a small application that we would fetch and run.

I will try to devise one such app. I can also get a part of the DB and obfuscate it to serve as a demonstration of the bug. I’ll need time, this is obviously a side project and I have to make sure my paid duties go first.

I’ll get back to you and Michal in time.

2 Likes

Who can I contact to provide a project and DB sample to help with the bugfix in Ecto?

@josevalim? @michalmuskala?

1 Like

If the repo cannot be public, you can send me a direct message on the forum. If it can be public, just open up an issue in the repo!

@josevalim invited you to a private session that I initiated with @michalmuskala few days ago. Can’t provide public DB, so it has to be like that.

Quoting @josevalim

So basically, I shouldn’t do left_joins and just do normal preloading that does number_of_assocs + 1 queries. Already tested and it works perfectly.

Or use subqueries, which will usually be faster for most data.

Would you please give an example? I know subqueries can be used, just not sure how in this context.

You can just do, basically the same query you’d do with a preload, as a subquery (or potentially just a join in most cases) and join them together as long as the primary (non-preload) data is small enough. When that data is too large or the preloads are too rare then it can be better to preload as a secondary step. Just depends on your data workload.

Eh. To be fair, I didn’t want to delve too deep in the data layer but you can’t get away with it every time. :102:

Maybe I’ll revisit this problem one day. I am a curious guy – and sometimes do mighty stupid mistakes as is apparent in the solution of my problem linked above. Still, I rarely leave stones unturned.

However, right now the classic preload works much better than I expected. I initially started this mini project because of over-obsession with eager data loading and the absolute minimum of DB queries. Turns out that I can export ~571k records to a CSV report in 15 minutes, of which 6 are spent in iterating the DB via a PG cursor (I think Repo.stream uses that but never checked) and then loading all the associations I am interested in.

And my final code turned out quite alright and pretty impressively quick (especially when comparing it to the Rails counterpart which runs for ~130 minutes in the best case scenario). Plus, a big set of joins didn’t turn out to be more efficient in this case; even a singular record loading took anywhere from 100 to 1100 ms.

Not worth it! I’ll not pursue for now. The current solution is good enough.