Monday, March 14, 2011

The CAP Theorem Event Horizon

The CAP Theorem has become a convenient excuse for throwing data consistency under the bus. It is automatically assumed that every distributed system falls prey to CAP and therefore must sacrifice one of the three objectives, with consistency being the consistent fall guy. This automatic assumption is simply false. I am not debating the validity of the CAP Theorem, but instead positing that the onset of CAP limitations—what I call the CAP event horizon—does not start as soon as you move to a second master database node. Certain approaches can, in fact, extend the CAP event horizon.

Physics tells us that different properties apply at different scales. For example, quantum physics displays properties that do not apply at larger scale. We see similar nuances in scaling databases. For example, if you are running a master slave database, using synchronous replication with a single slave is no problem. Add nine more slaves and it slows the system. Add another ninety slaves and you have a real problem with synchronous replication. In other words, consistency at small scale is no problem, but at large scale it becomes impossible, because your latency goes parabolic.
If you break the database into master (read/write) and slave (read-only) functions. You can operate a handful of slaves using synchronous replication without crossing the CAP event horizon, at least among the slaves. However, the master does present a SPOF (single point of failure), undermining availability.

Using a shared-nothing architecture, as soon as you introduce more than a single master, you hit the CAP event horizon. However, shared-disk / shared-cache systems like Oracle RAC and ScaleDB extend the CAP event horizon. They don’t invalidate the CAP event horizon, they merely extend it, by addressing the CAP issues while maintaining low latency.

Shared-disk databases enable multiple nodes to share the same data. Internal processes ensure that the data remains consistent across all nodes, while providing availability and partition tolerance. These processes entail a certain overhead from inter-nodal messaging. There are techniques that can be applied to dramatically reduce this inter-nodal messaging, resulting in a system that delivers the advantages of shared-disk, while delivering a performance profile rivaling shared-nothing, but that will have to wait for a later post.

While shared-nothing databases cross the CAP event horizon as soon as you add a second master, shared-disk databases extend this event horizon well into a handful of database nodes. Optimizations can further extend this eventuality to address dozens of database nodes (all masters). As you move to web scale applications, you will certainly cross the CAP event horizon, but most OLTP type applications can operate quite effectively on ten or fewer database servers, and in that case there is no need to throw consistency under the bus, solely in the name of the CAP Theorem.

More details on the CAP Theorem here, here and here.


  1. Hi Mike,

    thanks for the write-up!

    I'm a but confused though: one the one hand, you seem to define the "CAP event horizon" as "the onset of CAP limitations". I took that to mean: *all* of the CAP limitations. But then you say that shared-disk/shared cache solutions can somehow extend that "horizon" better than shared nothing systems.

    I understand that one can build a shared disk or shared cache system that, like you mention, "provides" availability and partition tolerance. But as far as I can see, the CAP theorem remains completely in effect - if a network partition occurs, availability will suffer if the system is built to maintain consistency. And if the load on the system remains constant, this will somehow translate into increasing latencies, as there is less of the cluster available to handle requests. In other words, it seems that the claim "addressing the CAP issues while maintaining low latency" is not true - at least, not when a network partition has occurred.

    (Please correct me if I'm wrong...)


  2. Roland, Latency is a matter of degree/design. For example, you might design a system that provides acceptable performance with 5 nodes, but you run with 6 nodes to have a performance buffer by design. Thus, loss of a single node will not degrade latency beyond what is acceptable. Of course, loss of (or connection to) multiple nodes will have a proportionally larger effect, but then you might want to run a 7th node to accommodate this eventuality.

    Unfortunately Japan has taught us that the worst case scenario you plan for will not be the worst actual case. Murphy's Law strikes again, sadly.

  3. I'm with Roland on this--I don't get the notion of a CAP event horizon. You aren't really escaping CAP at all. It seems you are just choosing to sacrifice tolerance of network partitions to get consistency and availability. Whether that is a good choice depends on the requirements of particular applications. For example, it's not a good choice for online testing or credit card processing. These both tend to operate over multiple sites to avoid ever being down.

  4. Robert, your assertion that we sacrifice network partition [tolerance] is incorrect. IBM IMS (their original architect was our original architect) employs a shared-disk architecture and it is still the leading DB powering banks (your credit card example). Geo-distribution does introduce additional challenges, because it increases latency in some cases beyond the tolerance level, especially when trying to maintain consistency. But let me turn your example back to you. Say you sacrifice consistency in a credit card example. Now you have hackers around the world making fraudulent purchases in unison. Many get through, because the system doesn't have a consistent view of the data to know that (a) the credit limit has been exceeded; (b) the buyer cannot be simultaneously in several places buying all at once. The CAP horizon is the point at which latency for a clustered system exceeds that which is acceptable and then you must decide what concessions you are willing to make.

  5. In general it is reasonable to try to be consistent to avoid the kind of attacks you describe. At least some of the merchant systems I know behave this way--route transactions back to a single system of record for approval to make life harder for evildoers. Others don't cross sites to avoid the problem altogether.

    Nevertheless, let's say you build a credit card processing system that depends on a global view across sites and a backhoe digs up your Internet connection by accident. There's no way to communicate between sites. At this point you either make a decision to allow transactions to proceed based on possibly inconsistent information or stop answering questions at one of the sites. There isn't really another choice. Moreover, this case is not distinguishable from the case where a connection across sites simply takes a long time to respond. So you can get it without the backhoe.

    It seems your event horizon argument is really a speed-of-light question of how far you can separate before the latency required for consistency in the absence of failures becomes annoying. This is is a useful notion. However, it's not really the same as CAP, which concerns what happens when a member fails to respond or a network breaks.

  6. My CAP Event Horizon is based on a cluster in a single data center. There are things you can do in a geo-distributed environment, but the more you have latency (speed of light issues) between nodes, the sooner you'll hit the CAP event horizon. The C in CAP is amplified as latency increases. In these situations you can have a consistent or mostly consistent fail-over (e.g. with DRBD) but having a fully consistent system, with all nodes in active (versus fail-over only) mode, that is geo-distributed remains a pipe dream.