Posted by Seamus on Tuesday, December 07, 2010.

Zero downtime deploys on the EngineYard AppCloud

One of the ways we maximize our uptime is by tag-teaming two full production clusters:

screenshot of EngineYard AppCloud dashboard with two environments, red and blue

Both “red” and “blue” can support 100% of our traffic, but only one of them is in charge of carbon.brighterplanet.com at a time. That way, we can make updates to the other one, test it at full production capacity, and “tag it in” when it’s ready (by changing DNS).

This is better for us than using staging environments because we’re not holding our breath for that “final” deploy to production. The tag-team approach lets us keep the old production environment running unchanged, ready to tag back in if the deploy process goes wrong.

It’s strong rollback, in the sense that all the last-known-good instances are still running (at least until we’re totally comfortable with the new ones.) It’s also graceful, in the sense that clients are not presented with a maintenance page or scheduled outage windows.

Fact: we have to store stuff offsite

If non-replicable data lived in the database master on red or blue, then we would have to export and import it every time we tagged in or out. To solve this, we make sure that all such data is stored offsite in our reference data web service, Amazon S3, hosted Mongo, etc.

Fact: we have to wait for DNS

When we switch carbon.brighterplanet.com from red to blue, we have to wait for the DNS change to propagate. If we want to roll back, we might have to wait again. As long as the old production environment worked but just had old code, this is usually OK.

Fact: we pay for more compute hours

For a while before and after any deploy, we need both red and blue at full production capacity. That costs compute hours. We think it’s worth it to avoid a single point of failure.

Fact: we rebuild from scratch more often

When we’re not preparing for a deploy, we may take red or blue down (whichever’s not “it”) to save money. When we prepare for a deploy, therefore, we need to rebuild the instances from scratch. Since we keep our build scripts up-to-date, this has not been a problem.

What blog is this?

Safety in Numbers is Brighter Planet's blog about climate science, Ruby, Rails, data, transparency, and, well, us.

Who's behind this?

We're Brighter Planet, the world's leading computational sustainability platform.

Who's blogging here?

  1. Patti Prairie CEO