Zero downtime Node.js reloads with clusters

Unlike PHP and other more traditional web applications, a Node.js server is just that--the server itself. Which means when we write a web application in Node.js, we're creating a server application that is loaded into memory, and handles requests from there. If the source files change, the application does not, because it's already been loaded into memory. In order to update the application with new code, the server has to be restarted.

This becomes a little challenging when the goal is to run a server cleanly with no downtime. If we reboot the Node server, we kill all existing connections, have a period of downtime, and then a fresh instance starts, ready to handle connections. That's not going to fly in production.

There are ways to prevent this, though. Up, for example, acts as a intermediate between your application and the request, which allows changes to be gracefully reloaded. It's nice, but a little heavier than I'd like, and doesn't play nicely with Node clusters. Besides, I thought figuring out how to handle this myself would make a good learning experience.

Fortunately, the cluster module has a disconnect event, which allows us to gracefully end the worker process. It will stop handling new requests, and the process will exit once the current requests are finished. That's perfect.

Here's an example of what I cooked up. We'll start with a separate app.js file; it's important that the application code and clustering code be separate.

app.js

var http = require("http");

http.createServer(function(req,rep) {
        rep.writeHead(200);
        rep.write("Hello, world!");
        rep.end();
}).listen(8080);

This is our application. It can do whatever we want, but I'm keeping it simple for my testing purposes.

Next, we need a master process to handle the clustering and worker respawns.

runner.js

var cluster = require("cluster");

if (cluster.isMaster) {
    // this is the master control process
    console.log("Control process running: PID=" + process.pid);

    // fork as many times as we have CPUs
    var numCPUs = require("os").cpus().length;

    for (var i = 0; i < numCPUs; i++) {
        cluster.fork();
    }

    // handle unwanted worker exits
    cluster.on("exit", function(worker, code) {
        if (code != 0) {
            console.log("Worker crashed! Spawning a replacement.");
            cluster.fork();
        }
    });

    // I'm using the SIGUSR2 signal to listen for reload requests
    // you could, instead, use file watcher logic, or anything else
    process.on("SIGUSR2", function() {
        console.log("SIGUSR2 received, reloading workers");

        // delete the cached module, so we can reload the app
        delete require.cache[require.resolve("./app")];

        // only reload one worker at a time
        // otherwise, we'll have a time when no request handlers are running
        var i = 0;
        var workers = Object.keys(cluster.workers);
        var f = function() {
            if (i == workers.length) return; 

            console.log("Killing " + workers[i]);

            cluster.workers[workers[i]].disconnect();
            cluster.workers[workers[i]].on("disconnect", function() {
                console.log("Shutdown complete");
            });
            var newWorker = cluster.fork();
            newWorker.on("listening", function() {
                console.log("Replacement worker online.");
                i++;
                f();
            });
        }
        f();
    });
} else {
    var app = require("./app");
}

Make sense? I'll break it down a little bit. We fork, per the cluster example. In all the non-master threads (the forked children), we load our separate application. Then, if the master process receives a SIGUSR2 signal, we delete the cache of our application module, retrieve a list of active worker ids, and disconnect them one by one. We do this to ensure that there is never a situation where we go below the threshold of optimal worker threads, or worse, reach a point where no workers are running. We then load a replacement worker, and repeat the process until all of the old workers have been told to disconnect.

It's a little rough, but it works. To test it, I run the application server, and hit it with an apachebench with 100,000 requests at 1000 concurrency to check for failed requests.

Beautiful. I can now integrate this into my post-update git hook for use in deployment.

Any questions, comments, ways to improve the code? Let me know. ;)