Yesterday we moved to a new queue, Shopify’s delayed_job (or dj).
After trying a few different solutions in the early days, we settled on Ara Howard’s Bj. It was fine for quite a while, but some of the design decisions haven’t been working out for us lately. Bj allows you to spawn exactly one worker per machine – we want a machine dedicated to workers. Bj loads a new Rails environment for every job submitted – we want to load a new Rails environment one time only. Both of these decisions carry performance implications.
If we were to run one Bj per machine, we’d only have four workers running as GitHub consists of four, ultra-beefy app slices. Unlike most contemporary web apps, the fewer the slices we have the better – it means less machines connected to our networked file system, and less machines create less network chatter and lock contention. As some of the jobs take a while to run (60+ seconds), four workers is a very low number. We want something like 20, but we’d settle for as few as 8.
We did hack Bj to allow multiple instances to run on a machine, but that ended up being counterproductive due to design decision #2: loading a new Rails environment for each job.
See, Rails takes a while to start up. Not only do you have to load all the associated libraries, but each require
statement needs to look through the entire, massive load path – a load path that includes the Rails app, Rubygems, the Rails source code, and all of our plugins. Doing this over and over, multiple times a minute, burns a lot of CPU and takes a lot of time. In some cases, the Rails load time is 99% of the entire background job’s lifetime. Spawning a whole bunch of Bjs on a single machine meant we effectively DoS’d the poor CPU.
I started working on a solution, but it was at this point we realized we were doing something wrong. These are not flaws in Bj, they are design decisions – these two ideas make Bj a pleasure to work with and perfect for simple sites. It’s working great on FamSpam. We had simply outgrown it, and hacking Bj would have been error prone and time consuming. Luckily, we had seen people praising Dj in the past and a solid recommendation from technoweenie was all we needed.
The transition took about an hour and a half – from installing the plugin to successfully running Dj on the production site, complete with local and staging trial runs (and bug fixes). Because we had changed queues so many times in the past, we were using a simple interface to submitting a job.
RockQueue meant we didn’t have to change any application code, just infrastructure. I highly recommend an abstraction like this for vendor-specific APIs that would normally be littered all throughout your app, as changing vendors can become a major pain.
Anyway, Dj lets us spawn as many workers on a machine as we want. They’re just rake tasks running a loop
, after all. It deals with locking and retries in a simple way, and works much like Bj. The queue is much faster now that we don’t have to pay the Rails startup tax.
We now have a single machine dedicated to running background tasks. We’re running 20 Dj workers on it with great success. There is no science behind this number.
Since people have already started asking “why didn’t you use queue X” or “you should use queue Y,” it seems reasonable to address that: we were very happy with Bj and wanted a similar system, albeit with a few different opinions. Dj is that system. It is simple, required no research beyond the short README, works wonderfully with Rails, is fast, is hackable, solves both the queue and the worker problems, and has no external dependecies. Also, it’s hosted on GitHub!
Dj is our 5th queue. In the past we’ve used SQS, ActiveMQ, Starling, and Bj. Dj is so far my favorite.
In a future post I’ll discuss the ways in which we use (and abuse) our queue. Count on it.