Ecto produces duplicate result

vfsoraki · February 6, 2020, 6:59pm

Here is the query I’m using to fetch news with some tags:

    query =
      from(n in News,
        join: t in assoc(n, :tags),
        where: like(t.name, ^"%#{tag}%"),
        order_by: [desc: n.published_at],
        limit: ^limit,
        preload: [:tags]
      )

    Repo.all(query)

I don’t know if I’m doing something wrong or not. But the result contains duplicate result. I’m using binary IDs and checking result of query, the duplicates are really duplicates with the same ID.

I also googled, and found some issues on Github but they seemed unrelated to my quite simple query.

Any help is appreciated.

EDIT: I just noticed that some result are repeated 3 times.

EDIT2: If you are here because you have the same problem, read until the end of topic. It contains some good insights.

jwarlander · February 6, 2020, 7:29pm

When this happens to me, I usually assume that something fishy is going on with a join somewhere… If you’ve got multiple matches (eg. a news article matches multiple tags), you’ll get a row per match.

If that’s NOT the case… Well, then it’s harder to say without seeing more of the code

vfsoraki · February 6, 2020, 8:03pm

By your reply I assume Ecto does not dedup query result? I thought it would do it internally, since I’m specifically querying an schema with id field.

andreaseriksson · February 6, 2020, 8:07pm

This is because you are selecting all new and it seems that every news row can have multiple tags.

This means that if one news article has 3 tags, ot will be 3 rows.

So, I think it might be as simple as remove the join and just use preload. That should spawn 2 queries. One for getting all the news, and one for getting all tags belonging to the list of news ids.

So, try that.

benwilson512 · February 6, 2020, 8:18pm

If you do preload: [tags: t] I think it dedups, cause it knows the joined tag rows are supposed to be under each news item.

andreaseriksson · February 6, 2020, 8:23pm

I can confirm that this works. Just tested. However, I don’t like it since the query, if you run it in a sql-terminal returns duplicate rows.
I would have gone with an EXISTS-subquery

vfsoraki · February 6, 2020, 8:35pm

This surely was the solution to dedup results, but another thing happens. The limit in query is not applied as one would think.
The query will return 10 rows for example, but then Ecto dedups them and makes the number smaller.
Also, when doing preload: [:tags] Ecto performs two queries, while using preload: [tags: t] results in one query.
I think I need to refactor my query.

vfsoraki · February 6, 2020, 8:37pm

Yes it did work, But see my last reply for how it turned out.
Care to explain how I can use an exists subquery please?

andreaseriksson · February 6, 2020, 8:42pm

That was what I meant with that I don’t like the solution. You can’t use it with limit and offsets (pagination) since Ecto does magic behind the scenes. I would instead to:

where: fragment("EXISTS(SELECT 1 FROM tags WHERE tags.news_id = n0.id AND tags.name = ?)", ^"%#{tag}%")

benwilson512 · February 6, 2020, 8:43pm

Right I mean this is how the raw SQL works too. Limit 10 wont’ do 10 tags per news item, it limits the total result set to 10 rows. Ecto’s querying is basically 1:1 with the underlying SQL.

vfsoraki · February 6, 2020, 8:45pm

The thing about this fragment is using n0. Not elegant.
Is there a way to use Ecto’s subquery with exists?

vfsoraki · February 6, 2020, 8:47pm

And, I forgot to mention that the association is many-to-many with a news_tag table

I got my query with exists working anyway, but still waiting to see if Ecto’s subqueries can be used instead of plain SQL.

andreaseriksson · February 6, 2020, 8:50pm

I think you can just do:

fragment("EXISTS(SELECT 1 FROM tags WHERE tags.news_id = ? AND tags.name = ?)", n.id, ^"%#{tag}%")

andreaseriksson · February 6, 2020, 8:52pm

And that is why you should avoid join on a has_many relationship. That is almost never the outcome you want. In that case, you would query from tags and join to news.

vfsoraki · February 6, 2020, 8:55pm

If I understand correctly, you mean to perform two separate queries, first find tags with some criteria, and then use their IDs to select the news in another query.
This would make 3 queries in my case (many-to-many).
I am currently trying to implement the exists with an Ecto subquery, but it is failing (maybe obviously).

andreaseriksson · February 6, 2020, 8:59pm

Actually, no. One query should be enough. Something like:

query =
      from(n in News,
        where: fragment("EXISTS(SELECT 1 FROM tags INNER JOIN news_tags ON tag_id = tags.id WHERE  news_tags.news_id = ? AND tags.name = ?)", n.id, ^"%#{tag}%"),
        order_by: [desc: n.published_at],
        limit: ^limit,
        preload: [:tags]
      )

    Repo.all(query)

Also, don’t shy away from sql, The database is the truth, and Ecto an abstraction. If its too complicated to convert to Ecto, just go with fragment

vfsoraki · February 6, 2020, 9:02pm

Yes, thanks. I already got that working. Here is what is working for me now.

query =
      from(n in News,
        where: fragment("EXISTS(select id from tags as t join news_tag nt on t.id = nt.tag_id where nt.news_id = n0.id and t.name like ?)", ^"%#{tag}%"),
        order_by: [desc: n.published_at],
        limit: ^limit,
        preload: [:tags]
      )

Literally the same thing as you wrote

Thanks again to both of you.

andreaseriksson · February 6, 2020, 9:03pm

Perfect! Glad to help!

dimitarvp · February 6, 2020, 9:05pm

Mark the guy’s comment as the solution for the future forum readers, if that’s okay?

vfsoraki · February 6, 2020, 9:07pm

I already marked the solution, the conversation showed me that I was doing something wrong. I think the marked solution is the answer of the topic. I also added a note to first post to show that this topic contains more information.