In the past, once we wanted to store more data or increase our processing power, the common option was to scale vertically (get more powerful machines) or further optimize the prevailing code base. However, with the advances in multiprocessing and distributed systems, it’s more common to expand horizontally, or have more machines to try to an equivalent task in parallel. We will already see a bunch of knowledge manipulation tools within the Apache project like Spark, Hadoop, Kafka, Zookeeper and Storm. However, so as to effectively pick the tool of choice, a basic idea of CAP Theorem is important. CAP Theorem may be a concept that a distributed database system can only have 2 of the 3: Consistency, Availability and Partition Tolerance.*WPnv_6sG9k4oG3S1A09MDA.jpeg

CAP Theorem is extremely important within the Big Data world, especially once we got to make trade off’s between the three, supported our unique use case. On this blog, I will be able to attempt to explain each of those concepts and therefore the reasons for the tradeoff. I will be able to avoid using specific examples as DBMS are rapidly evolving.

Partition Tolerance*qVoJNWH1osbrnOizRivF1A.png

This condition states that the system continues to run, despite the amount of messages being delayed by the network between nodes. A system that’s partition-tolerant can sustain any amount of network failure that doesn’t end in a failure of the whole network. Data records are sufficiently replicated across combinations of nodes and networks to stay the system up through intermittent outages. When handling modern distributed systems, Partition Tolerance isn’t an option. It’s a necessity. Hence, we’ve to trade between Consistency and Availability.

High Consistency*UnG2G7_h0kqI9IHtnUk3qg.png

This condition states that each one nodes see an equivalent data at an equivalent time. Simply put, performing a read operation will return the worth of the foremost recent write operation causing all nodes to return an equivalent data. A system has consistency if a transaction starts with the system during a consistent state, and ends with the system during a consistent state. During this model, a system can (and does) shift into an inconsistent state during a transaction, but the whole transaction gets rolled back if there’s a mistake during any stage within the process. Within the image, we’ve 2 different records (“Bulbasaur” and “Pikachu”) at different timestamps. The output on the third partition is “Pikachu”, the newest input. However, the nodes will need time to update and cannot be Available on the network as often.

High Availability*ABrjUrZAY6V1hEkFPYvC7A.png

This condition states that each request gets a response on success/failure. Achieving availability during a distributed system requires that the system remains operational 100% of the time. Every client gets a response, no matter the state of a person node within the system. This metric is trivial to measure: either you’ll submit read/write commands, otherwise you cannot. Hence, the databases are time independent because the nodes got to be available online in the least times. This suggests that, unlike the previous example, we don’t know if “Pikachu” or “Bulbasaur” was added first. The output might be either one. Hence why, high availability isn’t feasible when analyzing streaming data at high frequency.


Distributed systems allow us to realize a level of computing power and availability that were simply not available within the past. Our systems have higher performance, lower latency, and near 100% up-time in data centers that span the whole globe. Better of all, the systems of today are run on commodity hardware that’s easily obtainable and configurable at affordable costs. However, there’s a price. Distributed systems are more complex than their single-network counterparts. Understanding the complexity incurred in distributed systems, making the acceptable trade-offs for the task at hand (CAP), and selecting the proper tool for the work is important with horizontal scaling.