The GitHub Blog: Engineering

Bioclipse (a Java-based, open source, visual platform for chemo- and bioinformatics) has scripting support and the community has developed a great method for sharing those scripts: Gist!

They create Gists then tag them on delicious as bioclipse+gist.

For example, here’s one that downloads itself:

Your browser does not support IFrames

Check their blog post for more info. Nicely done, all!

(Seems like we could add a few features to Gist to make this sort of thing even easier.)

Compojure is a Clojure web framework similar to web.py or Sinatra.

Your browser does not support IFrames

The project also has a Wikibook and mailing list. Looks cool.

We’re now appending the gist name at the end of its raw url. That means it’s dead-simple to serve greasemonkey (or greasekit) scripts directly from gist.github.com. I was able to write my first script and install it in less than five minutes:

Your browser does not support IFrames

If you have greasemonkey installed and click the “view raw” link in the embedded gist above, your browser will ask you if you want to install the script.

This of course comes with a strong word of caution that you pay attention to the scripts you’re installing. Your browser will not execute the javascript, we serve it as plain/text, so feel free to hit cancel when the install dialog appears in order read over it first.

Note: You won’t see the new raw url until your gist cache is updated, so you can either wait until it falls out of the cache, or just make a simple change to your gist to update it immediately.

Henrik has a great article explaining why and how to display Git’s dirty state status (along with the branch, of course) in your bash prompt.

topfunky prefers a skull and bones for his dirty state indicator.

Thanks guys!

People have asked for our delayed_job god config.

Welp, here it is:

First our new queue, and now this:

The site should be much faster – but it’s still not fast enough. We’re hard at work making things like git clones, tree browsing, and commit viewing much faster. As always, we’ll keep you in the loop.

Yesterday we moved to a new queue, Shopify’s delayed_job (or dj).

After trying a few different solutions in the early days, we settled on Ara Howard’s Bj. It was fine for quite a while, but some of the design decisions haven’t been working out for us lately. Bj allows you to spawn exactly one worker per machine – we want a machine dedicated to workers. Bj loads a new Rails environment for every job submitted – we want to load a new Rails environment one time only. Both of these decisions carry performance implications.

If we were to run one Bj per machine, we’d only have four workers running as GitHub consists of four, ultra-beefy app slices. Unlike most contemporary web apps, the fewer the slices we have the better – it means less machines connected to our networked file system, and less machines create less network chatter and lock contention. As some of the jobs take a while to run (60+ seconds), four workers is a very low number. We want something like 20, but we’d settle for as few as 8.

We did hack Bj to allow multiple instances to run on a machine, but that ended up being counterproductive due to design decision #2: loading a new Rails environment for each job.

See, Rails takes a while to start up. Not only do you have to load all the associated libraries, but each require statement needs to look through the entire, massive load path – a load path that includes the Rails app, Rubygems, the Rails source code, and all of our plugins. Doing this over and over, multiple times a minute, burns a lot of CPU and takes a lot of time. In some cases, the Rails load time is 99% of the entire background job’s lifetime. Spawning a whole bunch of Bjs on a single machine meant we effectively DoS’d the poor CPU.

I started working on a solution, but it was at this point we realized we were doing something wrong. These are not flaws in Bj, they are design decisions – these two ideas make Bj a pleasure to work with and perfect for simple sites. It’s working great on FamSpam. We had simply outgrown it, and hacking Bj would have been error prone and time consuming. Luckily, we had seen people praising Dj in the past and a solid recommendation from technoweenie was all we needed.

The transition took about an hour and a half – from installing the plugin to successfully running Dj on the production site, complete with local and staging trial runs (and bug fixes). Because we had changed queues so many times in the past, we were using a simple interface to submitting a job.

<pre> RockQueue.push ‘post-receive’, { :user => user, :repo => repo, :before => before, :after => after, :ref => ref }, :priority => 50 </pre>

RockQueue meant we didn’t have to change any application code, just infrastructure. I highly recommend an abstraction like this for vendor-specific APIs that would normally be littered all throughout your app, as changing vendors can become a major pain.

Anyway, Dj lets us spawn as many workers on a machine as we want. They’re just rake tasks running a loop, after all. It deals with locking and retries in a simple way, and works much like Bj. The queue is much faster now that we don’t have to pay the Rails startup tax.

We now have a single machine dedicated to running background tasks. We’re running 20 Dj workers on it with great success. There is no science behind this number.

Since people have already started asking “why didn’t you use queue X” or “you should use queue Y,” it seems reasonable to address that: we were very happy with Bj and wanted a similar system, albeit with a few different opinions. Dj is that system. It is simple, required no research beyond the short README, works wonderfully with Rails, is fast, is hackable, solves both the queue and the worker problems, and has no external dependecies. Also, it’s hosted on GitHub!

Dj is our 5th queue. In the past we’ve used SQS, ActiveMQ, Starling, and Bj. Dj is so far my favorite.

In a future post I’ll discuss the ways in which we use (and abuse) our queue. Count on it.

GitHub was created as a side project, but it seems to have struck a nerve and gained traction quickly. As such, a lot of the infrastructure decisions were made not figuring on this sort of growth:

One of the major pieces of the infrastructure is how we store the repositories. The way it was originally setup worked great for a while, but it wasn’t sustainable.

As an example, lets take my github-services repository. Here’s where it was stored prior to yesterday:

/our-shared-drive/pjhyett/github-services.git

Straight forward and simple, as well as having the added benefit of the repo being easily locatable in the file system if we needed to debug an issue.

That works well unless you have thousands of folders sitting in the same directory. GFS tried as best as it could, but with the amount of IO we do at GitHub writing to and reading from the file system, a change had to be made quickly.

After migrating last night, taking the same repository, this is where it’s currently stored:

/our-shared-drive/5/52/af/b5/pjhyett/github-services.git

Instead of every user sitting in one directory, we’ve sharded the repositories based on an MD5 of the username. A large change to be sure, but with some number crunching by our very own Tom Preston-Werner, he told me everyone on the planet can sign up twice and we still won’t have to change the way we shard our repositories after this.

Another interesting point worth mentioning is the first directory, ‘5’, was setup specifically so we could add multiple GFS mounts (we currently use just one) using a simple numbering system to help scale the data when we start bumping up against that wall again.

Now, the question you may all be asking is why we didn’t do this from the beginning. The simple answer is it would have taken more time and prevented us from launching when we did. We could have spent a couple of extra weeks in the beginning figuring out and preventing bottlenecks, but the site may not have taken off and then we would have built a scalable site that three people use.

Truth be told, it’s a great problem to have, and the site is humming along smoothly now. Now we can get back to doing fun things like building new features for you guys and gals. Keep an eye out for the big one we’re launching next week!

Over the past several weeks I’ve been working on a secret Erlang project that will allow us to grow GitHub in new and novel ways. The project is called egitd and is a replacement for the stock git-daemon that ships with git. If you’re not familiar, git-daemon is what has, until today, served all anonymous requests for git repositories. Any time you used a command like git clone git://github.com/user/repo.git you were being served those files by git-daemon.

The reason we need egitd is for flexibility and power. We need the flexibility to map the repo name that you specify on the command line to any disk location we choose. git-daemon is very strict about the file system mappings that you are allowed to do. We also need it so that we can distinguish and log first-time clones of repos. Keep an eye out (down the road) for statistics that show you exactly how many times your public repo has been cloned!

Another benefit of coding our own git server is enhanced error messages. I can’t even begin to tell you how many people have come to us complaining about the following error which is caused by trying to push to the public clone address:

fatal: The remote end hung up unexpectedly

With egitd we can inject reasonable error responses into the response instead of just closing the connection and leaving the user bewildered. Behold!

fatal: protocol error: expected sha/ref, got '
*********'

You can't push to git://github.com/user/repo.git
Use git@github.com:user/repo.git

*********'

Still a little crufty, but until we can get something useful into core git, it’s the best we can do and should help many people as they learn git and get over some of the confusing aspects.

One of the slowest things you can do in Ruby is shell out to the operating system. As a contrived example, let’s open an empty file 1,000 times:

>> require 'benchmark'
>> `touch foo`
>> Benchmark.measure { 1000.times { `cat foo` } }.total
=> 4.51
>> Benchmark.measure { 1000.times { File.read('foo') } }.total
=> 0.04

The difference is clear – the very act of shelling out is expensive. And while 1,000 may seem high, we have plenty of content on GitHub with 30+ shell calls per page. It starts to add up.

The Problem with Grit

Our Grit library was written as an API to the git binary using, you guessed it, shell calls. In the past few weeks, as the site became slower and less stable, we knew we had to begin rewriting parts of our infrastructure. Response times and memory usage were both spiking. We began seeing weird out of memory errors and git segfaults.

Scott Chacon had been working on a pure Ruby implementation of Git for some time, which we’d been watching with interest. Instead of shelling out and asking the git binary for information, Scott’s library understands the layout of .git directories and uses methods like File.read to procure the requested information

Over the past few weeks we’ve been working with Scott to integrate his library into GitHub while he adds features and improves performance. Last night we rolled out a near-finished version of Scott’s library.

The result? Sweet, sweet speed.

Yep, we cut our average response time in half. (Lower numbers are better.)

Open Source

Scott will soon be merging the changes he made for us into his Grit fork. As a result, expect to see other Ruby-based Git hosting sites speed up in the next few weeks as they integrate the code we wrote.

We’re interested in funding the development of other Git related open source projects. If you’re working on something awesome that will drive Git adoption, please send us an email.

Future Enhancements

We’re still working to improve our architecture. As we roll out more changes, you’ll see them here. Everyone loves scaling.

We’ll have an hour or two of downtime tonight around midnight PST while the awesome dudes at Engine Yard upgrade our disk capacity. Thanks, see you on the flip side.

The GitHub Blog: Engineering

Scripting Bioclipse

Compojure: Clojure Web Framework

Gist for Greasemonkey

Dirty Git State in Your Prompt

dj.god

Speedy Queries

The New Queue

Scaling Lesson #23742

Supercharged git-daemon

Supercharged Ruby-Git

The Problem with Grit

Open Source

Future Enhancements

Downtime Tonight