GitGud, GitHub clone entirely written in Elixir

MarioFlach · March 6, 2018, 12:58am

Hello,
I want to share a project I’ve been working on for a while:

Background

Some time ago I came across a talk: How we scaled git lab for a 30k employee company.

asdasd
(basic overview of the system architecture)

The presentation was about how the team at git-lab solved scaling issues on their platform. After a few slides I wondered how this could be approached with languages like Erlang and Elixir.

After some moments of reflection, I had it! A basic concept on how things could fit together:

(joke aside, that’s pretty much what I came with )

Building blocks

Erlang/Elixir and OTP provide a lot of building blocks to power a scalable Git platform. The idea was to use nothing but Elixir and libgit2.

So here’s a more detailed overview of the architecture:

Phoenix does multiple things
- handles incoming Git HTTP commands.
- renders HTML for browsing users, repositories, etc.
- provides a basic GraphQL API for alternative clients (mobiles, etc).
SSH server implements :ssh_daemon_channel to handle Git SSH commands.
Ecto stores application data such as users, repositories, etc.

Authentication

A really nice thing about :ssh is that it provides support for authentication via password and public/private keys out of the box:

Password based authentication is supported both for HTTP and SSH.
Users can provide one or more SSH public keys to authenticate with.

NIFs / libgit2

If you are not familiar with libgit2, it’s a C written implementation of the Git core methods and functions. One very unique feature of the library is that you can provide your own storage backend. Which means you can plugin you own distributed K/V database instead of writing everything to the filesystem.

I heavily used code from the Erlang :geef library, refactored a good part and added a bunch of missing functions. Check the Elixir module and the C bindings.

Git transfer protocol & Packfile format

This is the fun part of the project .

libgit2 does not support server side commands, it only focuses on the client implementation. In the first iteration I cheated and used Ports to execute git-upload-pack and git-receive-pack. It worked well, both for SSH and HTTP.

But I wanted to have more control over the process (hooks, etc.) and having to depend on git only for the transfer protocol was a shame…

So I started digging in the protocol internals, docs. I worked with a lot of different network protocols in my career (medical field, DICOM, HL7, etc) but I must admit, the Git transfer protocol and the Git Packfile format was a quiet heavy sh**t to grasp.

It has lots of different binary optimisations.
It uses zlib to inflate chunks but only gives you the resulting size of the deflated data so I had to come with my own zlib C implementation.
The transfer protocol’s differs depending on the transport protocol.
Documentation is, ~~hard to find~~, ~~scarce~~, well hmm.

Its currently quiet messy, but have a look here for implementation details.

Project state

Still a proof of concept, it’s working but still. Almost no tests so unexpected things my happen.
If you are interested, download the code and give it a try. PR are very welcome.

nsuchy · March 6, 2018, 1:09am

Wow this project looks impressive - would you be willing to collaborate with me? I’m new to elixir development and would love to contribute to your project in what-ever way possible

cmkarlsson · March 6, 2018, 1:10am

Well done! This looks great. I’ve had the thought myself on implementing GitHub in elixir.

For me GitLab picked the completely wrong tool for the job. Just look at the stack! And everytime they have a release everyone complains about performance problems. No wonder!

I’ll definitely give this a go.

EDIT: Translated to Swedish the name is : Git God.

Linuus · March 6, 2018, 1:30pm

Nice! I planned something similar a year ago but had a baby instead and never got started

OvermindDL1 · March 7, 2018, 7:13pm

Hah, that’s awesome! ^.^

Actually, have you read this:
https://web.archive.org/web/20120215094414/http://andrew.hijacked.us/by_keyword/328/egit

The original website is down but it exists on archive.org, but he took egitd (the github back-end git server, originally made in erlang) that github screwed up on pretty bad, and with a few minor changes he gets it blazing fast (though by then github already rewrote it in C++ or whatever). ^.^

MarioFlach · March 7, 2018, 9:24pm

Thanks, have a look at the issues. PRs are also welcome.

MarioFlach · March 7, 2018, 9:28pm

I’m not familiar at all with git-lab but the presentation is pretty old (2016), as a growing business they might have changed things a lot in the last years.

MarioFlach · March 7, 2018, 9:52pm

I assume it’s the Github repo here: mojombo/egitd.

I came across this during development but I does not implement the Git transfer protocol, it uses Ports to execute git and only support the git:// transport protocol. It’s basically a wrapper around git.

My project is quiet different because not only the transport (SSH/HTTP) is written in Elixir, but also the transfer protocol (where most implementations I came across used to call git-uploack-pack and git-receive-pack and only pipe data in-between).

I did quiet a lot of code searching to find out how to implement several gotchas related to the Git protocol and the Packfile format. Only implementations I could found were in C, Haskell and O-Caml.

OvermindDL1 · March 7, 2018, 11:05pm

No I don’t think it was that one, that looks significantly different than what I think I remember seeing (though it’s been near-on a decade now)… >.>

But yes, implementing it internally to allow for maximum concurrency is definitely the way to go.

EDIT: Make sure to benchmark the speed. As an example, the transfer speed (of even a dead-simple git clone ...) is DREADFULLY slow for gitlab compared to github. Cloning some large projects on gitlab takes over an hour compared to ~5m for github. The linux kernel is a great test of cloning speeds if you want something to benchmark.

sorentwo · March 8, 2018, 4:40am

Thanks for the links, I’ve never seen those articles before and they look like a great read.

I was aware of the existence of egit and the Erlang history, though I wasn’t sure why they abandoned it. From what I recall they switched to straight Unicorn to handle all of the requests.

yurko · March 8, 2018, 9:14am

Thanks for sharing, that’s really impressive work! Do you have an idea of how long (in man-days) it took you to get it done?

MarioFlach · March 8, 2018, 1:55pm

No I don’t think it was that one, that looks significantly different than what I think I remember seeing (though it’s been near-on a decade now)…

In the blog posts related to the archive.org link you shared, the author is referencing following repo: Vagabond/egitd (which is a fork of the repo I posted lastly).

But yes, implementing it internally to allow for maximum concurrency is definitely the way to go.

I did not really implement it because of concurrency, but to have full control over the process and the possibility to add new features (hooks, fine grained permissions, etc.) in future.

EDIT: Make sure to benchmark the speed.

This is definitely on my list.

MarioFlach · March 8, 2018, 2:26pm

Thanks.

Difficult to tell how many time in took me so far. Well I started the project mid october and was intensively committing code until end of january.

Around 150 commits.
Around 4.100 lines of Elixir code.
Around 3.200 lines of C code (where a majority of it is only forked and refactored).

But be aware that the project is far from finished!

Currently, only the very core functionality is implemented, Git transport/transfer, basic user and repo schema, very basic authentication. In order to manage users and repositories, you have to use IEx as most Phoenix controller has missing (currently you can list user repos and browse their code). So it’s not really usage…

yurko · March 8, 2018, 2:41pm

Yes, first 90% done, other 90% to do, got it - still impressive

tme_317 · March 20, 2018, 5:34pm

Hi Mario,

What a fascinating project! Thanks for putting it out there. Beyond the GitHub-style functionality it has/will have it helps a great people to study code of well architected Elixir/Phoenix OSS apps at scale.

I was looking through it and was curious why you decided to ditch the Vue SPA and rewrite the frontend with EEX rendered React? I’m guessing you concluded the SPA added unnecessary complexity and was slower to develop? Were there any other considerations?

I am going the opposite direction on my project rewriting from Vue+EEX -> React SPA right now but still don’t have a good feeling for the tradeoffs regarding development speed yet.

Thanks!

OvermindDL1 · March 20, 2018, 8:23pm

Oh something like this definitely should not be SPA! If the website is not entirely useful (if not progressively enhanced) without javascript then it has major issues… o.O

MarioFlach · March 22, 2018, 2:37pm

Thank you @tme_317. Really appreciate that people take a look into this

I started the project for learning purpose only:

possibility to implement almost everything with Elixir/Erlang,
multiple network protocols involved,
Git transfer protocol being renown for it complexity (I knew it would be a lot of binary pattern-match fun),
lot’s of C NIFs in order to use libgit2 in Elixir/Erlang,
modular design

It make me happy that I got so many feedback for this. I published it mainly for people interested in the technical aspect.

As I currently have lot of things going on (2nd baby arriving in a few weeks, house construction and lot’s of work to cover the costs for everything), I’m not sure if the project will be a usable product anytime soon.

Still, I want to provide a very basic user-interface for basic actions such as registration, authentication, repository management and tree browser. Also a more detailed documentation for novice developers to be able to experiment with the project without to much hassle (see #15).

I was looking through it and was curious why you decided to ditch the Vue SPA and rewrite the frontend with EEX rendered React? I’m guessing you concluded the SPA added unnecessary complexity and was slower to develop? Were there any other considerations?

Indeed, I started with the idea of a having a Vue SPA. I’m not really a front-end guy and wanted to give Vue a try. I later decided to go the EEX/HTML path because I was not feeling really well with writting that much Javascript . In the end, I will try to render most of the interface using server-side rendered HTML and React/Relay for complex components (branch/tag <select> for example).

I am going the opposite direction on my project rewriting from Vue+EEX → React SPA right now but still don’t have a good feeling for the tradeoffs regarding development speed yet.

A while ago, I wrote a small SPA using React/Relay + GraphQL. I was pretty amazed how easy it was to implement once the backend was ready (Absinthe). But still, I’m not a Javascript person and for bigger projects, I prefer to stick to render plain boring HTML from the server…

MarioFlach · March 22, 2018, 2:55pm

Oh something like this definitely should not be SPA! If the website is not entirely useful (if not progressively enhanced) without javascript then it has major issues… o.O

I’m completely on your side. Indeed I switched from a REST/Vue SPA architecture to a server-side rendered HTML + React/GraphQL for complex components.

Still, for some things, I could argue that it’s better to write specific components with Javascript (complex UX not possible to implement with CSS only, load data only when required, etc.). Github has a bunch of stuff not working well with Javascript disabled.

OvermindDL1 · March 22, 2018, 3:26pm

Sure, as long as you can still ‘do’ everything without javascript, even if it is a few extra steps, that is what Progressive Enhancement is.

ShalokShalom · March 22, 2018, 6:30pm

Very tasty

Elm is something is can imagine for the UI