If you code for the web, you probably work with databases , which leads you to integrate, setup and most importantly understand how a database works.
Let us say at some point your project starts to draw attention, many users connecting and consuming data at the same time. You will have to scale!
One of the secrets behind a scalable architecture for a web application relies on knowing which database to use as well when and how to use them. For such, you must understand them.
In order to acquire such knowledge of the challenges and paradigms behind a project, it is interesting to look at the code and make sense of it.
Most of the times this process of diving into the code can be a demystifying adventure . The experience can provide you a clear picture of its functionalities.
Therefore, always remember:
I hear and I forget. I see and I remember. I do and I understand.
By creating your own, even if it is just for fun, you can understand the concepts and paradigms behind a project like so.
Found interesting? Keep scrolling !
What is a NoSQL database?Before anything else, let us refresh our memory with some concepts
Relationalor SQL databases emerged in the 80’s and they are still largely used .
By establishing a common query language , providing persistence, reporting, and support to transactions, relational databases grounded their success and became a reliable base for applications.
It is important to highlight that this success is strictly related to the applications requirements at the time.
H owever, since its conception, relational databases had several problems, and maybe the most important one was a hard time on mapping real-world entities to a structured form.
In the 90’s object-oriented databases unsuccessfully attempted to solve this mapping problem by storing entire objects. The failure is usually associated with the fact that at the time relational databases were used as an integration interface between different applications .
If you have ever worked with multiple applications connecting to a single database you probably know that the effort to change the integration database is almost the same as a complete rewrite of the applications.
L ater on, in the late 2000’s the exponential growth of the internet had a direct effect on the requirements for web applications , pointing design flaws on current architectures. That drove big players , companies like Google and Amazon, to come up with their own solutions towards solving scalability issues. As an example: Bigtable and Dynamo emerged at that time.
Those solutions had several common factors , they were: non-relational, cluster-friendly, schema-less, and most of them open-source.
T he foundation of what is known today by NoSQL relies upon those initiatives. But the term “NoSQL” appeared for the first time around 2009/2010, it was supposed to be more of a joke than anything else. The fact that it became such a buzzword to identify those modern kinds of databases was purely accidental because it is supposed to mean “not only SQL” and not “no-sql”.
Unlike the usual SQL databases , where there is support for a structured relational schema to store data, NoSQL databases have no relational support for storing data what so ever. Although you may emulate this behavior, but let us leave that for another time.
T he catch here is to understand that scalability for those modern databases relies upon distribution. It is common sense that relational databases were not initially designed to be distributed , they were not able to handle consistency on cluster level, they were supposed to be vertically scaled only . Meaning that if you want to increase your demand you would increase the server's processing power or add more memory.
Although that was not much of an option for companies like Google and Amazon, mostly, because a single database server could not handle Google’s billions of requests. The solution to scale those applications was based on changing the disposition of the servers, instead of increasing memory and processors , they started creating small and distributed clusters, what we know today as horizontally scaling.
Is currently known that horizontally scaling has several benefits , it makes possible to easily increase or decrease the amount of servers over demand, which has a huge impact on the cost of hosting such applications.
Imagine Netflix for instance, let us say that they have usually 100.000 people online simultaneously on a regular day, but when a new season is released that number can reach the millions .
Maybe they would be able to come up with a huge server and keep it running all the time, but how much would it cost to maintain such an infrastructure? By horizontally scaling, they are able to grow the number of servers on demand in order to handle this temporary flood, and then decrease again to normal bases after the massive demand. They actually do this several times during the day .
Since horizontal scaling relies upon several small units /servers, it is crucial to understand about some concepts of distributed systems, like the CAP theorem.
CAPEric Brewer , a well-known scientist, once said it is impossible for a distributed system to simultaneously support the following three guarantees:

Consistency― all servers see the same data at the same time
Availability― every request receives a response whether it succeeded or failed
Partition tolerance― the system continues to operate despite arbitrary partitioning due to network failures
U sually, people will say that you have to choose two of the aspects, in the real world is not a binary option because databases can focus on consistency over availability for some operations and the other way around for others.
Even so, it is possible to divide known databases by their focus on two of the three guarantees, calling them CA, CP or AP.
Note that this is not a binary rule, it is more of a way to understand the focus.
CA ― Consistency & Availability― mysql, PostgreSQL
Those databases will always ensure consistency, the data will always be reliable, as well as you can be sure to have a response for every request.
However, since they do not consider partition failures they will not perform well on clusters. It is also interesting to consider that they will choose consistency over response time, that can be something you are not willing to afford.
CP ― Consistency & Partition Tolerance― MongoDB, Redis, MemcacheDB, HBase
Such databases will ensure consistency and partition tolerance over availability. Meaning that it is important to keep the data consistent between all node, if for some reason a node is not available the system will not operate.
Again, the response time can be directly affected by the consistency check between all nodes.
AP ― Availability & Partition Tolerance― Cassandra, CouchDB, Voldemort, Riak
Finally, this kind of database focus on availability, they will always be operating and responding all the requests. To afford that they may ignore the consistency between the nodes, if a node is not reachable the other nodes will skip the consistency check and perform the operation.