Postgres recursive CTE on single table

MarioFlach · April 18, 2019, 12:52pm

Hello, I have a single SQL table looking as follow:

create table("git_commits", primary_key: false) do
  add :oid, :binary, primary_key: true
  add :parents, {:array, :binary}, null: false
end

Each commit has one or multiple parents. I would like to create a recursive Common Table Expressions in order to traverse the entire commit history.

Let’s say I want to start from commit 161b3955043064f33e3684183837fb7160cccfea and traverse all the parent commits.

I came with the following SQL query:

WITH RECURSIVE dg AS (
  SELECT oid FROM git_commits c WHERE c.oid = '\x161b3955043064f33e3684183837fb7160cccfea'
  UNION
  SELECT p.oid FROM git_commits AS c JOIN git_commits AS p ON p.oid = ANY(c.parents)
)

SELECT COUNT(oid) FROM dg;

The query is not invalid but does not return the number of parents (it returns more that it should).

So my question is how is the first (non-recursive) SELECT statement related to the second?
Is this kind of recursive CTE traversal even possible when working on a single table?

Thank you in advance.

crazymevt · April 18, 2019, 1:27pm

So when writing a recursive cte, you need to join back to the cte, which I don’t see you doing here. I believe the second one should be select c.oid from git_commits as c join dg on dg.oid = c.parents

hauleth · April 18, 2019, 2:16pm

First question is why store git commits in the Postgres instead using existing DB which is Git (yes, Git is also database, document one, but still counts).

So you want to count how much commits are ancestors of given commit, do I understand correctly? If so then:

WITH dg AS (
  SELECT c.oid, c.parents FROM git_commits c WHERE c.oid = …
  UNION
  SELECT p.oid, c.parents FROM git_commits p INNER JOIN dg ON p.oid = ANY(dg.parents)
)
SELECT COUNT(*) FROM dg;

MarioFlach · April 18, 2019, 8:33pm

I’m working on a Git hosting platform and trying to get better performances when browsing repositories. I’m using libgit2 and traversing/walking the commit history is quite slow by design.

When working on big repositories (linux kernel source tree, >800,000 commits) it can take up to 10 seconds when computing something like commit count. I’ve been able to work around quiet a few performance bottlenecks using different tricks (clever pagination in combination with Stream, etc.) but it still needs to get faster than that…

I’m currently experimenting with storing/caching some meta informations in the database and hope to achieve better results.

Thank you for the query, it works like a charm I’m surprised that it counts merges correctly. I would have assumed that it would return twice as much as commits each time the recursion queries more than one parent…

hauleth · April 18, 2019, 9:20pm

UNION by default removes duplicates, to include duplicated entries you need to use UNION ALL. About rest my answer I need to clone Linux repo first

MarioFlach · April 18, 2019, 10:26pm

Yes, I forgot about UNION having same effect than DISTINCT. My SQL is quite rusty.

As my git_commits table is not going to be write-heavy, would it make sense to create a RECURSIVE VIEW instead? And maybe in combination with a MATERIALIZED VIEW?

Can I pass parameters/variables to a VIEW, say pass my required repo_id and oid and benefit of PG caching the result?

  @doc """
  Returns the number of ancestors for the given `repo_id` and commit `oid`.
  """
  @spec count_ancestors(pos_integer, Git.oid) :: {:ok, pos_integer} | {:error, term}
  def count_ancestors(repo_id, oid) do
    query = """
    WITH RECURSIVE dg AS (
      SELECT c.oid, c.parents FROM git_commits c WHERE c.repo_id = $1 AND c.oid = $2
      UNION
      SELECT p.oid, p.parents FROM git_commits p INNER JOIN dg ON p.oid = ANY(dg.parents)
    )
    SELECT COUNT(*) FROM dg;
    """
    case Ecto.Adapters.SQL.query!(DB, query, [repo_id, oid]) do
      %Postgrex.Result{rows: [[count]]} -> {:ok, count}
    end
  end

  @doc """
  Returns the number of ancestors for the given `commit`.
  """
  @spec count_ancestors(t) :: {:ok, pos_integer} | {:error, term}
  def count_ancestors(%__MODULE__{repo_id: repo_id, oid: oid} = _commit) do
    count_ancestors(repo_id, oid)
  end

MarioFlach · April 18, 2019, 10:37pm

Or is it even better to write an IMMUTABLE FUNCTION?

My git_commit table is append-only as Git objects are never changed. The function will always return the same ancestor tree for a specific oid.

hauleth · April 18, 2019, 10:57pm

However it seems that both Git hosting platforms I have checked compute that directly in Git (with some caching):

GitLab
Gitea

So maybe instead of duplicating commit data almost exclusively for counting it would be better to use similar approach. Once I have found information that GitHub uses Git as a DB for most of their data with “conventional” DB storing only some metadata.

MarioFlach · April 19, 2019, 12:00am

That’s what I’m doing right now. I use exclusively libgit2 to interact with the underlying git objects.
I currently have two interchangeable storage backends available :

:filesystem (default) – .git directory
:postgres (experimental) – using git_references and git_objects tables.

The problem is that traversing the commit history is quiet slow by design. There is no fast way of querying the history of a given pathspec (tree, blob) or even counting commit ancestors.

I’m not quite sure how you would implement caching effectively. I don’t want to cache on demand, the first hit will always be slow…

Currently, I’m experimenting with storing meta data for commit and tree objects and the performance boost I got so far is impressive.

In future I would like to extend my :postgres backend to write Git objects in separated tables (references, commits, tags, blobs, trees). Being able to perform more complex queries (authors, committers, query avatars, GPG signatures, ACLs, etc.) directly on the DB would be a

hauleth · April 19, 2019, 8:15am

You can cache on upload once.

You could use git notes attached to commits for storing metadata and cached values, like the commit “depth”. Alternatively instead of storing all commit data you could just store cached metadata in Postgres, however I would say that this still would be a little overdoing. But this is how I would solve that, if your way works for you then ok.

The problem is that this will not scale well in time as PSQL wasn’t meant to store BLOBs.