Postgres recursive CTE on single table

Hello, I have a single SQL table looking as follow:

create table("git_commits", primary_key: false) do
  add :oid, :binary, primary_key: true
  add :parents, {:array, :binary}, null: false

Each commit has one or multiple parents. I would like to create a recursive Common Table Expressions in order to traverse the entire commit history.

Let’s say I want to start from commit 161b3955043064f33e3684183837fb7160cccfea and traverse all the parent commits.

I came with the following SQL query:

  SELECT oid FROM git_commits c WHERE c.oid = '\x161b3955043064f33e3684183837fb7160cccfea'
  SELECT p.oid FROM git_commits AS c JOIN git_commits AS p ON p.oid = ANY(c.parents)


The query is not invalid but does not return the number of parents (it returns more that it should).

So my question is how is the first (non-recursive) SELECT statement related to the second?
Is this kind of recursive CTE traversal even possible when working on a single table?

Thank you in advance.

1 Like

So when writing a recursive cte, you need to join back to the cte, which I don’t see you doing here. I believe the second one should be select c.oid from git_commits as c join dg on dg.oid = c.parents

First question is why store git commits in the Postgres instead using existing DB which is Git (yes, Git is also database, document one, but still counts).

So you want to count how much commits are ancestors of given commit, do I understand correctly? If so then:

WITH dg AS (
  SELECT c.oid, c.parents FROM git_commits c WHERE c.oid = …
  SELECT p.oid, c.parents FROM git_commits p INNER JOIN dg ON p.oid = ANY(dg.parents)

I’m working on a Git hosting platform and trying to get better performances when browsing repositories. I’m using libgit2 and traversing/walking the commit history is quite slow by design.

When working on big repositories (linux kernel source tree, >800,000 commits) it can take up to 10 seconds when computing something like commit count. I’ve been able to work around quiet a few performance bottlenecks using different tricks (clever pagination in combination with Stream, etc.) but it still needs to get faster than that…

I’m currently experimenting with storing/caching some meta informations in the database and hope to achieve better results.

Thank you for the query, it works like a charm :hugs: I’m surprised that it counts merges correctly. I would have assumed that it would return twice as much as commits each time the recursion queries more than one parent…


UNION by default removes duplicates, to include duplicated entries you need to use UNION ALL. About rest my answer I need to clone Linux repo first :wink:

:+1: Yes, I forgot about UNION having same effect than DISTINCT. My SQL is quite rusty.

As my git_commits table is not going to be write-heavy, would it make sense to create a RECURSIVE VIEW instead? And maybe in combination with a MATERIALIZED VIEW?

Can I pass parameters/variables to a VIEW, say pass my required repo_id and oid and benefit of PG caching the result?

  @doc """
  Returns the number of ancestors for the given `repo_id` and commit `oid`.
  @spec count_ancestors(pos_integer, Git.oid) :: {:ok, pos_integer} | {:error, term}
  def count_ancestors(repo_id, oid) do
    query = """
      SELECT c.oid, c.parents FROM git_commits c WHERE c.repo_id = $1 AND c.oid = $2
      SELECT p.oid, p.parents FROM git_commits p INNER JOIN dg ON p.oid = ANY(dg.parents)
    case Ecto.Adapters.SQL.query!(DB, query, [repo_id, oid]) do
      %Postgrex.Result{rows: [[count]]} -> {:ok, count}

  @doc """
  Returns the number of ancestors for the given `commit`.
  @spec count_ancestors(t) :: {:ok, pos_integer} | {:error, term}
  def count_ancestors(%__MODULE__{repo_id: repo_id, oid: oid} = _commit) do
    count_ancestors(repo_id, oid)

Or is it even better to write an IMMUTABLE FUNCTION?

My git_commit table is append-only as Git objects are never changed. The function will always return the same ancestor tree for a specific oid.

However it seems that both Git hosting platforms I have checked compute that directly in Git (with some caching):

So maybe instead of duplicating commit data almost exclusively for counting it would be better to use similar approach. Once I have found information that GitHub uses Git as a DB for most of their data with “conventional” DB storing only some metadata.

That’s what I’m doing right now. I use exclusively libgit2 to interact with the underlying git objects.
I currently have two interchangeable storage backends available :

  • :filesystem (default) – .git directory
  • :postgres (experimental) – using git_references and git_objects tables.

The problem is that traversing the commit history is quiet slow by design. There is no fast way of querying the history of a given pathspec (tree, blob) or even counting commit ancestors.

I’m not quite sure how you would implement caching effectively. I don’t want to cache on demand, the first hit will always be slow…

Currently, I’m experimenting with storing meta data for commit and tree objects and the performance boost I got so far is impressive.

In future I would like to extend my :postgres backend to write Git objects in separated tables (references, commits, tags, blobs, trees). Being able to perform more complex queries (authors, committers, query avatars, GPG signatures, ACLs, etc.) directly on the DB would be a :raised_hands:


You can cache on upload once.

You could use git notes attached to commits for storing metadata and cached values, like the commit “depth”. Alternatively instead of storing all commit data you could just store cached metadata in Postgres, however I would say that this still would be a little overdoing. But this is how I would solve that, if your way works for you then ok.

The problem is that this will not scale well in time as PSQL wasn’t meant to store BLOBs.

1 Like