GitGud, GitHub clone entirely written in Elixir

MarioFlach · August 22, 2019, 8:37am

Wow I didn’t know about Xgit. Looks very promising
I really like the repository storage abstraction. Is Xgit.Repository.OnDisk fully compatible with the Git filesystem storage format?

My approach is very different one, I wrote NIFs to wrap a good part of libgit2. I also wrote an experimental storage backend to work with libpq.

Ditching libgit2 all together for an Elixir based Git implementation would be

scouten · August 22, 2019, 8:23pm

Hi Mario,

The intention is that Xgit.Repository.OnDisk is compatible with command-line git. The approach I take when building disk-related components is to run the equivalent operations in command-line git and Xgit and folder-diff the two. For example:

github.com

elixir-git/xgit/blob/d12191843f4b4e6c3bae16b13915e344eda1120e/test/xgit/repository/on_disk/put_loose_object_test.exs#L16


alias Xgit.Core.Object
alias Xgit.Repository
alias Xgit.Repository.OnDisk


import FolderDiff


describe "put_loose_object/2" do
  @test_content 'test content\n'
  @test_content_id "d670460b4b4aece5915caf5c68d12f560a9fe3e4"


  test "happy path matches command-line git (small file)", %{ref: ref, xgit: xgit} do
    Temp.track!()
    path = Temp.path!()
    File.write!(path, "test content\n")


    {output, 0} = System.cmd("git", ["hash-object", "-w", path], cd: ref)
    assert String.trim(output) == @test_content_id


    assert :ok = OnDisk.create(xgit)
    assert {:ok, repo} = OnDisk.start_link(work_dir: xgit)

That said, I’ll admit that at this early stage of the project I’m not aiming for 100% coverage of all variations of every piece of the on-disk format. As an example, I’m working on writing and parsing the .git/index file. For the moment, I only support version 2 because that’s what’s generated every time I do anything locally.

My primary goal for the on-disk version of things is to prove that I’ve got the abstractions right and that I can implement the core data model of git correctly.

That said, to anybody who wants to fill in some of those edge cases (i.e. versions 3 and 4 of index file format): I will happily review a PR. (Parse index file is mostly done; writing index files is my work-in-progress this week. I expect to finish that feature by the weekend and have an 0.1.6 release out then.)

m3rl1n · September 21, 2019, 7:04am

This seems pretty interesting. I might have a decent use case for this. The only question I have is if you can use git lib to store data in any back end, what back ends are acutely possible and how can we improve the ones we have currently with GItHub or Gitlab.

A super slick data store might really improve things, something spanner or similar.

MarioFlach · September 23, 2019, 9:15am

libgit2 let you implement your custom backend for Git. I’ve written an experimental postgres backend built with libpq, check postgres_backend.c for implementation details.

My current plan is to provide

a filesystem (default backend) storage coupled with Erlang/Elixir OTP goodness for distributing Git repositories on a multiple node setup. The GitRekt.GitAgent already support repository access through message-passing (GenServer), the missing part is the “dispatcher” (consistent hash ring) that will be responsible for finding the right node for a given USER/REPO.
a libgit2 backend for a distributed key/value store including failover, redundancy, etc.

MarioFlach · September 23, 2019, 9:25am

I’m also tackling the issue system right now. The UX is built with React and Relay, on the backend it uses GraphQL (Absinthe) and Phoenix.Presence.

All events are handled in real-time through GraphQL subscriptions and Phoenix.Channel.
You’ll receive real-time events when

somebody writes a new comment or edit an existing comment
the issue is closed/reopened/renamed
the issue is referenced in a Git commit
somebody is typing a new message (Phoenix.Presence).

MarioFlach · September 23, 2019, 9:33am

Once version 0.2.9 is released (I need to polish a few things), I will deploy the new version to https://git.limo. I will allow pushing repositories but set a limit in order to prevent pushing very large repositories.

Exadra37 · November 22, 2019, 11:29pm

No liveview

Congrats for the work done until now

MarioFlach · November 24, 2019, 11:21am

Nope. I’ve experimented with LiveView a bit. In the end I was not convinced it’s the right tool for the project.

I don’t like that each websocket connection has to refetch data for each request (when socket mounts). Also I use JS to improve client-side UX in most places and don’t like to have the server consume resources for that kind of stuff.

GraphQL provides a lot of features for what I need (batching, subscription, async. etc.) and :absinthe and Relay function pretty much without hassles.

Exadra37 · November 24, 2019, 12:03pm

Maybe I am wrong, but what I read and observed from LiveView it sounds that your use case is a good fit for it.

But this doesn’t happen in every request. For what I understand just when it looses the internet connection, or its a new session.

How does your JS implementation differs?

For what I understand LiveView have hooks to allow you to do that.

For what I know LiveView doesn’t consume much resources, and it only sends the diffs to the client.

It sounds that you are already very familiar with the frontend complexity, thus not buying easily into LiveView, fair-enough

OvermindDL1 · November 25, 2019, 5:16pm

Cachex among other patterns fixes that (depending on type of data), but it’s by design because each are different processes and so forth.

The server resources are pretty light, but yeah I agree, I use unpoly.js for most of my things to ‘enhance’ a page to work faster, I highly recommend looking at it, you can even optimize sending requests back to minimize data sent too (or don’t change anything on the server side).

Absinthe on the other hand is awesome, but it requires a lot of client-side scripting, just make sure you don’t break text clients like elinks or links2 or so.

ityonemo · November 25, 2019, 6:29pm

A super slick data store (s3 in elixir) would be freaking amazing. I have a use case for this at work and if I don’t see anything good crop up I might try to implement this.

PS iirc (correct me if I’m wrong) spanner requires highly sychronized atomic clocks so unless you’re Google you don’t really get to use it. But also the geniuses at Google don’t necessarily have an underlying architecture that respects the theory of relativity in quite the same way that the BEAM does.

MarioFlach · October 5, 2021, 1:10pm

I’ve continued my journey and committed to the project every now and then.

https://git.limo/redrabbit/git-limo

So here’s a little update .

Distributed Setup

The current version is running on Fly . I’ve got a small cluster (two nodes: FRA, LAX) setup with :libcluster so adding new nodes should work automatically.

Each Fly instance has it own storage attached for storing Git repositories. When a user creates a new repository it is assigned to the local’s node storage and all the Git objects will be stored there:

I live in Austria, so my Git repositories are stored on the closest instance running in Frankfurt.

When accessing one of my repos from the US, the instance in Los Angeles will route all Git commands to the right node.

In order to get things working without too much latency, I had to refactor a big chunk of code to batch Git commands together and keep the number of roundtrips between instances low. In the end I’m quite happy with the results

Try for yourself:

redrabbit/elixir is running in FRA.
redrabbit/phoenix is running in LAX.

Repository Pools

I’ve implemented some kind of distributed routing pool on top of Erlang’s :global.

Here’s a screenshot of the supervision tree:

When a node start’s up, GitGud.RepoSupervisor does a few things:

tags the local storage (if not already tagged) and registers the resulting id across the cluster.
starts a GitGud.RepoStorage worker for handling filesystem operations.
starts a GitGud.RepoPool supervisor for handling Git commands.

The GitGud.Repo schema has a :volume field which points to the storage VOLUME where it data is stored. When creating a repositories it is assigned to the local storage:

field :volume, :string, autogenerate: {GitGud.RepoStorage, :volume, []}

With this in mind, let see how we can run Git commands on a specific repository:

repo = GitGud.RepoQuery.user_repo("redrabbit", "git-limo")
{:ok, agent} = GitRekt.GitRepo.get_agent(repo)
{:ok, head} = GitRekt.GitAgent.head(agent, head)
{:ok, commit} = GitRekt.GitAgent.peel(agent, head)
{:ok, commit_msg} = GitRekt.GitAgent.commit_message(agent, commit)
IO.puts commit_msg

The above example prints the HEAD commit message for redrabbit/git-limo.

The interesting part here is GitRekt.GitRepo.get_agent/1 which is implemented by GitGud.Repo:

  defimpl GitRekt.GitRepo, for: GitGud.Repo do
    def get_agent(repo), do: GitGud.RepoPool.checkout(repo)
  end

As you can see, it rely on GitGud.RepoPool for retrieving a Git agent from the pool on the right node. Let’s dive into it .

Internally, the pool can be seen as a DynamicSupervisor of DynamicSupervisors. GitGud.RepoPool.checkout/1 being the entry-point for fetching agents. It also provides a few nice things:

auto scale – grows/shrinks the number of agent processes based on demand.
global cache – agents in a pool share a global ETS table.
round robin – agents are distributed using round-robin.
node aware – a pool will always start on the right node based on the repo’s VOLUME.

Git Agents

The GitRekt.GitAgent module is the backbone for running Git commands. While the public API is quite easy to grasp, it hides a lot of complexity.

An agent is basically a wrapper around GitRekt.Git. Here’s a very basic usage example:

{:ok, agent} = GitRekt.GitAgent.start_link("path/to/workdir")
{:ok, tags} = GitRekt.GitAgent.tags(agent)
for tag <- tags do
  IO.puts "Tag #{tag.name} -> #{Base.decode16(tag.oid)}"
end

In the above example, agent is a dedicated process for running Git commands.

Note that it is also allowed to run Git commands in the current process as well:

{:ok, agent} = GitRekt.Git.repository_open("path/to/workdir")
{:ok, branches} = GitRekt.GitAgent.branches(agent)
for branch <- branches do
  IO.puts "Branch #{branch.name} -> #{Base.decode16(branch.oid)}"
end

In the above example, agent is a NIF-resource representing a libgit2 repository.

Transactions

GitRekt.GitAgent provides support for transactions aka. batching a bunch of Git operations in one call. This is very important when running Git commands on a separate node:

# agent is a PID running on an other node
{:ok, head} = GitRekt.GitAgent.head(agent, head) #1
{:ok, commit} = GitRekt.GitAgent.peel(agent, head) #2
{:ok, commit_msg} = GitRekt.GitAgent.commit_message(agent, commit) #3
IO.puts commit_msg

Running the above code would make three separate GenServer.call/2 resulting in quite some latency. We can fix this by batching the commands in a transaction:

# agent is a PID running on an other node
{:ok, commit_msg} =
  GitRekt.GitAgent.transaction(agent, fn agent ->
    with {:ok, head} <- GitRekt.GitAgent.head(agent, head),
         {:ok, commit} = GitRekt.GitAgent.peel(agent, head) do
      GitRekt.GitAgent.commit_message(agent, commit)
  end)

In the above example, the three commands are execute in a single call on the dedicated agent process. Reducing the overall latency…

Caching

An additional feature of GitRekt.GitAgent is caching. When running transaction/3 we can pass a cache key as the 2nd argument:

def commit_info(agent, commit) do
  GitAgent.transaction(agent, {:commit_info, commit.oid}, fn agent ->
    with {:ok, author} <- GitAgent.commit_author(agent, commit),
        {:ok, committer} <- GitAgent.commit_committer(agent, commit),
        {:ok, message} <- GitAgent.commit_message(agent, commit),
        {:ok, parents} <- GitAgent.commit_parents(agent, commit),
        {:ok, timestamp} <- GitAgent.commit_timestamp(agent, commit),
        {:ok, gpg_sig} <- GitAgent.commit_gpg_signature(agent, commit) do
      {:ok, %{
        oid: commit.oid,
        author: author,
        committer: committer,
        message: message,
        parents: Enum.to_list(parents),
        timestamp: timestamp,
        gpg_sig: gpg_sig
      }}
    end
  end)
end

You may have noticed the {:commit_info, commit.oid} tuple given to transaction/3. This tells the agent that the transaction should be cached using this key.

Calling commit_info/2 two times in a row would result in the following log output:

[debug] [Git Agent] transaction(:commit_info, "b662d32") executed in 361 µs
[debug] [Git Agent] > commit_author(<GitCommit:b662d32>) executed in 6 µs
[debug] [Git Agent] > commit_committer(<GitCommit:b662d32>) executed in 5 µs
[debug] [Git Agent] > commit_message(<GitCommit:b662d32>) executed in 1 µs
[debug] [Git Agent] > commit_parents(<GitCommit:b662d32>) executed in 4 µs
[debug] [Git Agent] > commit_timestamp(<GitCommit:b662d32>) executed in 11 µs
[debug] [Git Agent] > commit_gpg_signature(<GitCommit:b662d32>) executed in 6 µs
[debug] [Git Agent] transaction(:commit_info, "b662d32") executed in ⚡ 3 µs

We can observe that the first call executes the different commands one by one and cache the result while the second one fetches the result directly from the cache without having to actually run the transaction.

There’s a lot more to tell about GitRekt.GitAgent’s internals (streaming support, mechanism to prevent the garbage collector for deleting NIF-resources, etc.). If you’re interested I can write a small post about it.

LiveView

On the frontend, I’ve managed to introduce Phoenix LiveView and replace all my React/Relay components with a LiveView counterpart. For example, the GitGud.Web.TreeBrowserLive is used to navigate across a Git repository tree. Here’s a list of all views/components:

GitGud.Web.BlobHeaderLive
GitGud.Web.BranchSelectLive
GitGud.Web.CommentFormLive
GitGud.Web.CommentLive
GitGud.Web.CommitDiffLive
GitGud.Web.CommitLineReviewLive
GitGud.Web.GlobalSearchLive
GitGud.Web.IssueEventLive
GitGud.Web.IssueFormLive
GitGud.Web.IssueLabelSelectLive
GitGud.Web.IssueLive
GitGud.Web.MaintainerSearchFormLive
GitGud.Web.TreeBrowserLive

Fast Git Backend Server

I also refactored the Git backend aka. GitRekt.WireProtocol which was slow and consumed a lot of resources.

When pushing a repository, the incoming PACK file is now directly streamed to the disk. This increases raw performances about 700% and greatly reduced the amount of RAM and CPU used for the operation.

The performance boost allows to fetch/push across nodes in a cluster setup. When you push via SSH you will send the PACK to the nearest node which is then streamed to the right node in the cluster.

MarioFlach · October 5, 2021, 1:42pm

Also, the project is sponsored by AppSignal . You can check the appsignal branch if you’re interested:

You can also check-out the fly branch if you want to deploy on your own Fly instances.

EdmondFrank · October 10, 2021, 3:23pm

Very cool

mattei · October 11, 2021, 1:07am

Look, I just want to mention… your taste in library naming is top notch. GitRekt

MarioFlach · October 14, 2021, 8:27pm

GitRekt for the dangerous, not very safe NIF/C code (low-level Git), GitGud for the Elixir counterpart (schemas, supervision tree, etc.) .

MarioFlach · November 2, 2021, 12:32am

I’ve updated to Phoenix 1.6 and LiveView 0.16 and started refactoring my LiveViews to work with the new live session .

Now when browsing a repository, codebase navigation happens with live_redirect/2. This works across most live views and makes the entire user-experience much snappier .

I’m really happy for this new feature. While it makes the frontend faster, it also reduces a lot of load on the backend as well.

I’ve also add support for topbar.js in order to show the progression indicator.

You can give it a try here:

https://git.limo/redrabbit/git-limo

derek-zhou · November 29, 2021, 7:04pm

I always want to have a self-hosted github. So I try the demo server before I deploy one myself. I registered a user at git.limo uploaded a ssh pubkey, however, it is not clear to me how to push. The brief instruction for a newly generated project suggested:

git remote add origin https://git.limo/derek-zhou/test.git
git push -u origin master

I’d assume https is for read only access? Anyway, it does not work:

fatal: unable to access 'https://git.limo/derek-zhou/test.git/': server certificate verification failed. CAfile: none CRLfile: none

I’ve also tried the github convention:

git remote add origin git@git.limo:derek-zhou/test.git

But then it just hangs.

from the git-limo repo linked above it seems to want:

git remote add origin ssh://derek-zhou@git.limo:1022/derek-zhou/test.git

But it hangs too. @MarioFlach ?

MarioFlach · November 30, 2021, 3:20pm

Hi @derek-zhou,

Pushing over HTTPS is supported. You must authenticate with your git.limo username/password thought.

From the error message, it looks like you are missing root certificates somehow. Depending on your OS you might want to install/update them:

Pushing over SSH is actually also supported. You should be able to authenticate with

your git.limo username/password.
your git.limo username and associated SSH key.

Note that this differs from other Git hosting platforms providing authentication with SSH key only with the ‘git’ user.

Now, because Fly.io does not allow for external port 22, I’m using 10022. You should be able to push via:

git remote add origin ssh://derek-zhou@git.limo:10022/derek-zhou/test.git

Edit: There is a bug when showing the SSH clone URL for a repository. It shows the internal port 1022 and not the external one 10022… Will fix that.

derek-zhou · November 30, 2021, 3:55pm

Thanks. After I disable sslverify in git I can push and pull in https. However, there is still something wrong with the ssh push/pull. SSH itself works with my key:

derek@mail:~/projects/notes$ ssh derek-zhou@git.limo -p 10022
You are not allowed to start a shell.
Connection to git.limo closed.

However, git over it does not.
I have seen 2 distinct error messages:

error: failed to push some refs to 'ssh://derek-zhou@git.limo:10022/derek-zhou/test.git'

and:

error: remote unpack failed: %GitRekt.GitError{code: -1, message: "missing trailer at the end of the pack"}
fatal: the remote end hung up unexpectedly

My git version is a bit old though:

derek@mail:~/projects/notes$ git --version
git version 2.20.1

The strange thing is https works.

EDIT:
a newer git version seem to work:

git version 2.30.2