MarioFlach
Postgres recursive CTE on single table
Hello, I have a single SQL table looking as follow:
create table("git_commits", primary_key: false) do
add :oid, :binary, primary_key: true
add :parents, {:array, :binary}, null: false
end
Each commit has one or multiple parents. I would like to create a recursive Common Table Expressions in order to traverse the entire commit history.
Let’s say I want to start from commit 161b3955043064f33e3684183837fb7160cccfea and traverse all the parent commits.
I came with the following SQL query:
WITH RECURSIVE dg AS (
SELECT oid FROM git_commits c WHERE c.oid = '\x161b3955043064f33e3684183837fb7160cccfea'
UNION
SELECT p.oid FROM git_commits AS c JOIN git_commits AS p ON p.oid = ANY(c.parents)
)
SELECT COUNT(oid) FROM dg;
The query is not invalid but does not return the number of parents (it returns more that it should).
So my question is how is the first (non-recursive) SELECT statement related to the second?
Is this kind of recursive CTE traversal even possible when working on a single table?
Thank you in advance.
Most Liked Responses
MarioFlach
I’m working on a Git hosting platform and trying to get better performances when browsing repositories. I’m using libgit2 and traversing/walking the commit history is quite slow by design.
When working on big repositories (linux kernel source tree, >800,000 commits) it can take up to 10 seconds when computing something like commit count. I’ve been able to work around quiet a few performance bottlenecks using different tricks (clever pagination in combination with Stream, etc.) but it still needs to get faster than that…
I’m currently experimenting with storing/caching some meta informations in the database and hope to achieve better results.
Thank you for the query, it works like a charm
I’m surprised that it counts merges correctly. I would have assumed that it would return twice as much as commits each time the recursion queries more than one parent…
MarioFlach
That’s what I’m doing right now. I use exclusively libgit2 to interact with the underlying git objects.
I currently have two interchangeable storage backends available :
:filesystem(default) –.gitdirectory:postgres(experimental) – usinggit_referencesandgit_objectstables.
The problem is that traversing the commit history is quiet slow by design. There is no fast way of querying the history of a given pathspec (tree, blob) or even counting commit ancestors.
I’m not quite sure how you would implement caching effectively. I don’t want to cache on demand, the first hit will always be slow…
Currently, I’m experimenting with storing meta data for commit and tree objects and the performance boost I got so far is impressive.
In future I would like to extend my :postgres backend to write Git objects in separated tables (references, commits, tags, blobs, trees). Being able to perform more complex queries (authors, committers, query avatars, GPG signatures, ACLs, etc.) directly on the DB would be a ![]()
hauleth
You can cache on upload once.
You could use git notes attached to commits for storing metadata and cached values, like the commit “depth”. Alternatively instead of storing all commit data you could just store cached metadata in Postgres, however I would say that this still would be a little overdoing. But this is how I would solve that, if your way works for you then ok.
The problem is that this will not scale well in time as PSQL wasn’t meant to store BLOBs.








