Introduction

If you're writing a small application and deploy it on a server, you can easily serve a couple of thousand users (depending on what it does). As your user base grows, you start to notice issues: requests failing, server getting restarted (or going permanently down if you really dropped the ball). That's easy to solve, right? You just add more hardware.

The more hardware you add the greater the probability of a failure happening (not because of too many requests, but because this is the circle of life of a few boring computer science things to do with how unreliable the network is, network topology, latency, security and bandwidth limits – read up about the distributed system fallacies). Not to mention the probability of at least one hardware component failing at any given time.

Now you have a million users every month (hey, great job on that idea - it's taking off) you have 16 servers (I was gonna say 10 but I like round numbers), one master database and a couple of replicas. However, you are getting behind on shipping new features because you are too busy firefighting.

With a million users, if one server goes down and stays in traffic, 62,500 users get an annoying error response. If it doesn't stay in traffic, good job on the load balancer work, but if the error propagates you lose your house and your family the whole cluster.

Now you have three options:

  1. Find the root cause and fix it
  2. Keep firefighting, invest in an alerting system so you can respond faster, diagnose and fix issues as soon as possible.
  3. Embrace the fact that your system will fail, build some resilience so your systems have a decent chance of auto-recovering without you having to do anything.

Now, 1 sounds the logical thing to do. 2 sounds expensive but still a rational way to go. 3 sounds like black magic. Given that you have limited experience with distributed systems (for the sake of the article) you may choose to go with 1. No, neither 1 nor 2 is a bad option. However, 1 may not exist and 2 doesn't work by itself (otherwise you'd stop reading). 3 is the holy grail.

If you manage to do 3, no more sleepless nights, you can have time for your family to diagnose and fix issues in the morning. If you could just isolate failure and make your system responsive even if the backend infrastructure (your database and replicas) go away, you can finally focus on delivering features and get some sleep at night.

Building resilience at Rentalcars.com

Our setup (a very simplified version of it)

Let's see a simplified version of Rentalcars.com. In the diagram below, we have the website where users go to type things and click buttons. Let's imagine we also have these services on the backend:

  1. Search cars
  2. Translations
  3. Geo-location service
  4. Rating service
  5. Payment service

Each of these services is either:

  • following the micro-service architecture, so it has its own database and all these boring principles or
  • just another monolithic service that uses shared resources or
  • a mainstream database.

In this simplified set-up, a user hits www.rentalcars.com, the website asks the translation service for content written in the user's language. The user types some text in the search box, selects a location with the help of the geo-location service and hits search. The search cars service finds cars (with a spell called inveniet currus – yep, I'm throwing some Latin in there) but then, it queries the translation service to finalise the results. Now the website has a list of cars, but for each one it makes another query to the ratings service to get ratings for each car.

Now the user has a screen with cars, a number that represents how much people like that car and messages in their own language. Cool, right?

The user selects a car and finally submits payment details. Then the website uses the payment service to charge the user and confirm the booking.

The interactions between services look like this
sample_app_diagram

Now that we've drawn the lifecycle picture, let's see what can go wrong. Don't worry: we'll keep the list short for the purposes of this article, so no asteroids hitting anything or aliens abducting servers.

Things that can go wrong and you have no control over

As mentioned above, in a distributed system anything can go wrong (including the asteroid scenario). However, even if you've written the perfect code, the most likely reason your systems fail is some kind of network issues on infrastructure you don't control (remember, a distributed system is one that your computer fails due to a failure of another computer that you don't even know existed).

But enough chit chat... let's start the horror story.

A lot of users are happily using your service booking cars going on holidays and in general moving on with their lives. Then, one user comes in and makes a search and the loading logo is spinning for about a minute. They hit refresh. More users are coming over and they are doing the same. How many users waiting for search results can your system handle? Well, fewer than you think.

Let's see what happens in more technical terms.

Users waiting for a service response:
long_queue

This is not a deadlock. This is not a bug in your code. Firstly, this is happening because your system doesn't enforce a timeout – if it does enforce one, the throughput is lower than the demand, given that everything is timing out. This is a design flaw. What should your system do?

Let's take a step back (actually let's take a time machine and take several decades' worth of steps back).

In the early 20th century, households were rapidly being introduced to the power of electricity. However, connecting enough appliances to your (picking a decade at random) 1930s house would cause overheat and given the right (well, wrong) conditions you'd need a new house because that one burst into flames. To avoid the need to replace houses every now and then, the initial solution was to use fuses. The idea is that the fuse would burn much faster than the house, opening the circuit and thus enforcing a fail fast(er) strategy.

There are a lot of details, but just to speed this up, now we have circuit breakers in our houses, an invention patented in 1879 by Thomas Edison (welcome to the 70s - the 1870s). If you've heard about circuit breakers in software, you know where this is going (also, the title is a huge spoiler).

But I want to share another resilience story before I go into more implementation details.

In 1969, NASA landed Neil Armstrong and Buzz Aldrin on the moon (welcome to the 1960s). This would have been far from a success story if not (among other things) for the resilience logic built into the lunar module's computer. Due to an infrastructure issue the misplacement of a switch, the Apollo 11 computer was being overloaded with tasks. Fortunately, the software apart from detecting errors, was also able to discard low priority tasks. Thus, it was able to do a very important task called "landing on the freakin' moon" and job well done by Margaret H. Hamilton and team.

Now, these two stories might seem a bit disconnected at first but they do share the same goal: to interrupt functionality and avoid further damage or total disaster. They are also taking the fail fast approach instead of waiting for someone to diagnose and fix the issue (can you really diagnose it before Apollo crashes or before the house catches fire?).

Let's go back to our system where we left it. If we detect a persisting issue when querying the search service and stop more requests from hitting it, we achieve the following:

  1. Users no longer wait for the timeout to get feedback about what's happening
  2. The system gets resources back faster (no more long-running network connections).
  3. We avoid burning the house crashing the lunar module losing both the website server(s) and the search service server(s).
  4. We isolate the errors individually, making it easier to (also) diagnose problems; if your house is on fire, you have a huge problem called – sit tight – "my house is on fire". You don't know what the initial issue is. If you have an open circuit breaker instead, guess what? You know one of your appliances overloaded the system and, depending on how smart your circuit breaker set-up is, you may also know which appliance.

Do 2 and 3 sound familiar now? How do we stop the requests? We use circuit breakers – the more flexible, software kind.

Circuit breakers with Hystrix

Hystrix is an open-source software developed by Netflix that offers circuit breaker, thread pool management, caching, request collapsing and hotspot detection capabilities.

To be closer to the actual work done at Rentalcars.com before introducing Hystrix, let's have a look at this design.

sample_app_diagram_with_cache

As shown, we already have a caching layer that offers an 'out of the box' way of granulating the service calls. The caching layer consists of several cache buckets. For the payments service, the cache bucket just delegates all the requests to the service without caching anything; the reasons for having a NoOp cache are outside the scope of the article, but having a NoOp anything is important.

Hystrix, through its api, provides some things called a Hystrix Command and a Hystrix Command Group. The latter implies multiple commands.

We can then derive either of these relationships:

  • Each cached item, through a cache key, is one Hystrix command with its own command key. The whole cache bucket maps to a Hystrix command group. Or
  • Each cache bucket is one Hystrix command, and multiple cache buckets belong to a command group.

There are advantages and disadvantages to choosing either. The first gives the highest granularity that you can get in terms of configuring thread pools; you allow the server to do at most N concurrent requests (that implies N concurrent cache misses). In other words you granulate the resources by service. However, it doesn't help the circuit breaker detect problems and decide to trip or not and it will be really complicated to configure different timeouts for each key. You are also in danger of overflowing memory with Hystrix command keys, not to mention that your monitoring screen will look a mess.

The second option is far better. No matter what the cache key is, you have one command key per bucket. You still create one Hystrix command per cache miss, but all share the same command key if they share the same cache. You can also configure multiple groups for different request behaviours (e.g. if a subset of your services is slow you can configure a group using a larger thread pool to run those requests. For the faster services you have a small thread pool). To define big and small is up to you; some work is needed for gathering data, creating traffic scenarios and testing different thread pool configurations but it is the most reliable way to go.

Picking option 2, we have now 4 additional configuration settings for our caching system:

  1. whether the circuit breaker functionality is enabled
  2. the Hystrix group the cache belongs to
  3. timeout
  4. Circuit breaker sensitivity (or failure thresholds)

If you recall the Apollo 11 usecase, you can get some inspiration on configuring the above depending on how critical each service is to your system. For instance, you may want to configure the timeout of the payment service queries – for example, to be on the 99.5th percentile of response times with some additional margin. You can also set the failure threshold of the circuit breaker to be higher than ratings, for example. And you may want to map critical and non-critical queries to different thread pools (bulkheading).

Caches with circuit breakers and thread pool grouping
sample_app_diagram_with_cache_and_threadpools

The darker the thread pools, the bigger they are.

Note: The size relationships are wild guesses of what I have in mind for this imaginary system... You need to investigate the proper sizes and timeouts for your own system.

Adding some monitoring will give us a screen like this
hystrix_monitoring

Some hints about monitoring: we configured our own Hystrix-monitoring using consul for service discovery to auto-configure Netflix Turbine for aggregating all the data.

What have we achieved

  1. First, we now have circuit breaker support: if a service starts timing out or giving errors we cut the wire off for some time and we try again later. Failing fast also gives us the ability to clean up connections and free resources faster.
  2. We have a cap on concurrent requests going out to services. This protects both ends.
  3. We enforce a timeout per service.
  4. We have fine grained configuration depending on how critical the queries are to the booking process.
  5. Adding monitoring to the equation will immediately point us to the failing service when there is a problem. Without all this work – if you are a bit late to detect a problem, all services will report as failing (house is on fire).
  6. By coupling the circuit breakers with the caching layer, we easily get partial functionality even with circuit breakers tripped (i.e. we can still serve cached data).

Results

The following graphs demonstrate the importance of circuit breakers – and they are real scenarios from live traffic here at Rentalcars.com.

Page load times during a live issue without circuit breakers
capacity_test_no_cb

Putting circuit breakers in the equation
capacity_test_with_cb

Blue – rejections due to a tripped circuit breaker and green – rejections due to timeouts

Page load times during a live issue with circuit breakers
circuit_breakers

Summary

In summary, this is about embracing the fact that you cannot afford to diagnose and fix a live issue. You want your systems to be resilient enough to isolate failures, thus avoiding a domino effect on your whole infrastructure.

You also need to make sure that you have reporting and alerting in place because empirically, the circuit breakers will start reporting issues before issues escalate and affect other systems.

Some additional work on the load balancer part is required, so that services are taken out of traffic and/or restarted automatically. But this will be an article for another time.

Final note: The older your tech stack, the more difficult it will be to apply these concepts. The work behind this article was undertaken to codebase that's over 10 years old. New frameworks and setups should allow you to apply caching, circuit breakers, load balancing and other goodies with just some configuration changes (example: https://spring.io/guides/gs/circuit-breaker/).

Further reading