Very slow inner joins on many to many tables

PabloG6 · December 22, 2019, 8:29am

I’ve been creating an online platform that allows users to follow other users and tags. The table where you can follow other users is called the Relationships table.

schema "relationships" do
    belongs_to :follower, Writer, type: :binary_id
    belongs_to :following, Writer, type: :binary_id

    timestamps()
  end

The table where you can follow other tags is called the interested_in_topics table.

  schema "interested_in_topics" do
    belongs_to :writer, Writer, type: :binary_id
    belongs_to :topics, Topics, type: :binary_id
  end

All articles can be placed under up to five tags and this information is stored in the categories_articles table


  schema "categories_articles" do
    belongs_to :categories, Topics, type: :binary_id
    belongs_to :articles, Articles, type: :binary_id
  end

When I want the user feed, I get all the articles that each writer the user has followed has written and all the articles tagged to the tags the user is following, then return them on distinct. The query takes about 15 seconds, is there anyway i can speed this up??? I’ve already indexed my foreign keys in the many to many tables above.

 def list_feed_articles(writer_id, params \\ %{}) do
  

    query =
      from(articles in Articles,
        preload: [:categories, :writer],
        left_join: r in Relationships,
        on: r.follower_id == ^writer_id,

        left_join: t in InterestedinTopics,
        on: t.writer_id == ^writer_id,
        left_join: ac in CategoriesArticles,
        on: t.topics_id == ac.categories_id,
        where: articles.writer_id == r.following_id or
        articles.writer_id == ^writer_id,
        or_where: ac.articles_id == articles.id,
        order_by: [desc: articles.inserted_at],
        select: [:id, :title, :description, :cover_photo, :writer_id, categories: [:id, :slug, :title],  writer: [:id, :username, :name, :avatar]],
        distinct: [articles.id]
        )




    Repo.paginate(query, params)
  end

kip · December 22, 2019, 8:50am

When all else fails, running EXPLAIN ANALYZE on your query is the go. One way is to call:

iex> "EXPLAIN ANALYZE " <> Ecto.Query.to_string(query) |> Ecto.Adapters.SQL.query!

This blog post will help understand what it is telling you.

You will want to check for anything that looks like “table scan” on anything but articles which would be a sign that indexes aren’t being used as you expect.

hauleth · December 22, 2019, 10:09am

DO NOT USE LEFT JOIN, especially when you then do distinct. Either use INNER JOIN or use preload to fetch associations.
left_join: t in InterestedinTopics, on: t.writer_id == ^writer_id (and so on) does something completely different from what you want it to do.
I do not see migrations. Do you have proper indices set?

PabloG6 · December 24, 2019, 5:30am

I’ve changed all the left joins to joins and the query speed is still around the same…
I’m trying to find all the records in the many to many InterestedInTopics table that the user has followed with the on: query. What is the query actually doing? How do i rectify the problem.
I set all my indices after the fact on all the tables

Relationships table

  def change do
    create table(:relationships) do
      add :follower_id, references(:writers, on_delete: :delete_all, type: :uuid)
      add :following_id, references(:writers, on_delete: :delete_all, type: :uuid)
      timestamps()
    end

    create unique_index(:relationships, [:follower_id, :following_id])
  end

CategoriesArticles table

  def change do
    create table(:categories_articles) do
      add :categories_id, references(:topics, on_delete: :delete_all, type: :uuid)
      add :articles_id, references(:articles, on_delete: :delete_all, type: :uuid)

      timestamps()
    end

    create unique_index(:categories_articles, [:categories_id, :articles_id])
  end

InterestedInTopics tables

  def change do
    create table(:interested_in_topics) do
      add :writer_id, references(:writers, on_delete: :nothing, type: :uuid)
      add :topics_id, references(:topics, on_delete: :nothing, type: :uuid)
    end

    create unique_index(:interested_in_topics, [:writer_id, :topics_id])
  end

Indexing all the tables

def up do
    create index(:categories_articles, [:articles_id])
    create index(:categories_articles, [:categories_id])

    create index(:interested_in_topics, [:writer_id])
    create index(:interested_in_topics, [:topics_id])


    create index(:writes_about_topics, [:topics_id])
    create index(:writes_about_topics, [:writer_id])

    create index(:relationships, [:follower_id])
    create index(:relationships, [:following_id])

  end

This is the EXPLAIN ANALYZE for my query

  ->  Sort  (cost=348867.92..349168.86 rows=120375 width=625) (actual time=219114.862..219542.702 rows=169575 loops=1)
         Sort Key: a0.id, a0.inserted_at DESC
         Sort Method: external merge  Disk: 20688kB
         ->  Nested Loop  (cost=3.05..304969.97 rows=120375 width=625) (actual time=0.516..218052.555 rows=169575 loops=1)
               Join Filter: ((a0.writer_id = r1.following_id) OR (a0.writer_id = 'da0a3e33-a56a-4126-84ae-55be0e95ef8e'::uuid) OR (c3.articles_id = a0.id))
               Rows Removed by Join Filter: 44200425
               ->  Seq Scan on articles a0  (cost=0.00..1542.11 rows=6811 width=625) (actual time=0.036..56.830 rows=6800 loops=1)
               ->  Materialize  (cost=3.05..343.93 rows=2225 width=32) (actual time=0.003..15.606 rows=6525 loops=6800)
                     ->  Nested Loop  (cost=3.05..332.80 rows=2225 width=32) (actual time=0.417..159.986 rows=6525 loops=1)
                           ->  Seq Scan on relationships r1  (cost=0.00..1.04 rows=1 width=16) (actual time=0.016..0.027 rows=3 loops=1)
                                 Filter: (follower_id = 'da0a3e33-a56a-4126-84ae-55be0e95ef8e'::uuid)
                           ->  Hash Join  (cost=3.05..309.51 rows=2225 width=16) (actual time=0.129..42.863 rows=2175 loops=3)
                                 Hash Cond: (c3.categories_id = i2.topics_id)
                                 ->  Seq Scan on categories_articles c3  (cost=0.00..287.00 rows=7300 width=32) (actual time=0.011..19.267 rows=7300 loops=3)
                                 ->  Hash  (cost=2.03..2.03 rows=82 width=16) (actual time=0.332..0.334 rows=81 loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 12kB
                                       ->  Seq Scan on interested_in_topics i2  (cost=0.00..2.03 rows=82 width=16) (actual time=0.015..0.179 rows=81 loops=1)
                                             Filter: (writer_id = 'da0a3e33-a56a-4126-84ae-55be0e95ef8e'::uuid)
                                             Rows Removed by Filter: 1
 Planning time: 1.506 ms

tenzil · December 24, 2019, 6:42am

Hi @PabloG6, please post latest ecto query (list_feed_articles),
You have joined InterestedinTopics with specific writer_id and then filtered articles with writer_id., can you try this t.writer_id = articles.writter_id.

PabloG6 · December 24, 2019, 7:11am

Here you go @tenzil

def list_feed_articles(writer_id, params \\ %{}) do
   

    query =
      from(articles in Articles,
        preload: [:categories, :writer],
        join: r in Relationships,
        on: r.follower_id == ^writer_id,

        join: t in InterestedinTopics,
        on: t.writer_id == ^writer_id,
        join: ac in CategoriesArticles,
        on: t.topics_id == ac.categories_id,
        where: articles.writer_id == r.following_id or
        articles.writer_id == ^writer_id,
        or_where: ac.articles_id == articles.id,
        order_by: [desc: articles.inserted_at],
        select: [:id, :title, :description, :cover_photo, :writer_id, categories: [:id, :slug, :title],  writer: [:id, :username, :name, :avatar]],
        distinct: [articles.id]
        )


    
    ## I'm using scrivener to paginate my queries.
    Repo.paginate(query, params)
  end

tenzil · December 24, 2019, 7:32am

Any difference by changing t.writer_id = articles.writter_id, in the joins?
and ac is already joined with articles, is this => “or_where: ac.articles_id == articles.id” required?

PabloG6 · December 24, 2019, 7:56am

on: t.writer_id == articles.writer_id made no significant changes in the speed of the query.
You second suggestion reduces the query time by about 20 times (super fast) but it also removes a significant chunk of the data on my query. i’ll try and look into my query and see if i can restructure it

tenzil · December 24, 2019, 8:14am

thats great news , I would suggest you to add your data columns to the select explicitly, even try removing preload and select the required columns. Do ping your latest sql query.

PabloG6 · December 24, 2019, 8:17am

I found the answer. I fixed the query by adding a second join parameter for the CategoriesArticles many to many table. This is my current solution.

 def list_feed_articles(writer_id, params \\ %{}) do


    query =
      from(articles in Articles,
        preload: [:categories, :writer],
        join: t in InterestedinTopics,
        on: t.writer_id == ^writer_id,
        join: ac in CategoriesArticles,
        on:  t.topics_id == ac.categories_id,
        on: articles.id == ac.articles_id,
        join: r in Relationships,
        on: r.follower_id == ^writer_id,
        where: articles.writer_id == r.following_id,
        or_where: articles.writer_id == t.writer_id,
        or_where: articles.id == ac.articles_id,
        order_by: [desc: articles.inserted_at],
        select: [:id, :title, :description, :cover_photo, :writer_id, categories: [:id, :slug, :title],  writer: [:id, :username, :name, :avatar]],
        distinct: [articles.id]
        )

      {string_query, _data} = Ecto.Adapters.SQL.to_sql(:all, Repo, query)
      IO.puts string_query
      IO.puts writer_id


    Repo.paginate(query, params)
  end

I’m not entirely sure why it works, because i’ve only used sql within the context of ORMS like ecto and django, but if anybody can give an explanation that’d be fantastic. Thanks!

sb8244 · December 24, 2019, 1:59pm

There are 4 seqscan in that query. Some can be okay in certain situations, but it’s generally a red flag. Especially with 125k rows coming back.

Indexing certain columns (usually the join IDs) can help improve the speed. This will become more pronounced on larger data sets.