Module 4: ACID and Database Scalability Trade-offs

Module Overview

SQL or NoSQL? Why not both! Picking the right database for a situation can be a tricky problem, with many tradeoffs. SQL gives us ACID (Atomicity Consistency Isolation Durability) guarantees and powerful analytical queries, and NoSQL gives us huge scale and usually "good enough" guarantees. But innovation is happening all the time, with NewSQL proposing to give us the best of both worlds.

Learning Objectives

Objective 01 - Understand and explain the advantages and disadvantages of traditional SQL databases

Overview

When you only have a hammer, everything looks like a nail. When you only have SQL, must everything be relational?

For traditional SQL, you do need to fit your data in a relational paradigm. This is actually not a bad thing - most of the time, this is welcomed structure. But it is important to understand the strengths and weaknesses of traditional databases, to make informed decisions between alternatives.

We've been using SQL the whole week - what are the tradeoffs of it as a tool?

Follow Along

The immediate and most obvious cost of a relational database is the necessity of robustly specifying a schema, including relations between relevant tables. Another cost that becomes quickly evident with a real-world application is - things change! And, while you can use ALTER TABLE to update your schema, it can be pretty tricky to do so while preserving and not losing any existing data.

Another prominent cost, and the main motivation of “big data” proponents of NoSQL, is scalability. Traditional relational databases need to fit an index for all keys (and sometimes other attributes) in memory - this limits the ability to grow the data beyond a certain size (namely the size where all of your keys fit in however much RAM you can afford).

This was highly important in the history of Google and several other major Internet companies, as they needed to figure out how to scale while using large quantities of relatively cheap commodity hardware. The solution is to distribute work across computers - but that requires structuring the work (and the end result) in a fairly clever way. More on this in the next sprint with Apache Spark!

But it's also important to note that, in present day, RAM is actually pretty cheap, and your data is probably not as big as you think it is. Unless you literally have hundreds of millions of users (and maybe even then), you can with proper structure fit things in main memory and use a relational approach.

So, what are the advantages of a relational approach? ACID - Atomicity, Consistency, Isolation, Durability - a set of guarantees provided by the use of transactions to ensure that data is always in a valid state, even if e.g. a query inserting new data is interrupted by a failure of any sort (including external to the system such as a power outage).

That's a pretty nice set of guarantees! It should be clear why, for instance, financial data is often suited for the ACID paradigm. But in general, ACID just means “things work reliably as you'd expect”, and it's nice to have for any application.

If relational databases have ACID guarantees, what do non-relational approaches have? They usually give up or weaken one of the four qualities - for instance, many big data systems are “eventually consistent”. This means that there can be intermediary states where the data is actually not consistent, but over time it reaches consistency. The typical acronym for adopting eventual consistency is BASE - Basically Available, Soft state, Eventually consistent.

Challenge

Think of a situation (besides finance) where ACID guarantees are mission-critical, and another where it may be okay to weaken them. Describe both to a fellow student, and explain your reasoning.

Additional Resources

Objective 02 - Make informed decisions about alternative databases

Overview

In tech, we have lots of tools - this leads to the problem of choosing which tools to use when. Relational and non-relational - when should we use one over another, and can we get “the best of both worlds?”

The first generation of “big data” solutions, as exemplified by Google's MapReduce paradigm, made significant tradeoffs in the name of scalability. Most notably, such systems weakened the ACID guarantees, and only provided a subset of the full querying functionality of SQL (queries such as sorting generally require comparison of all values in memory). They not only allowed companies to scale with available and affordable technology, but also lead many companies to develop solutions that were “bigger” and more complicated than they needed to be.

SQL is really just a standard for a querying language, albeit one tightly associated with the relational approach. But since NoSQL became the banner of the initial non-relational approaches, NewSQL has been coined as a term to characterize attempts at extreme scalability that still provide ACID guarantees.

It's also good to be aware of security issues - SQL injection is a common family of attacks where user-provided data causes undesired changes in the database. But despite the name, any database system is potentially vulnerable, and the solution is to always sanitize user input (building queries in your application rather than trusting what they pass in).

Follow Along

A good way to understand the challenges of NoSQL is to envision the MapReduce paradigm. MapReduce provides a framework for a programmer to specify a job that is then distributed and performed by a potentially large number of worker nodes (horizontal scalability). At a high level, there are three steps to computation in MapReduce:

MapReduce is no longer the most current approach for these problems, and for us, it is OK to only understand it at a relatively high level. It was developed using principles from functional programming. Start by reading about these general principles if you want to explore and dig deeper into the topic.

Functional Programming

Functional programming treats functions as first-class citizens - objects in their own right that can be created and manipulated. This means you can also write higher-order functions (functions that take functions), such as map (apply a function to all items in a collection, returning an equal sized collection) and reduce (apply a function to combine items from a collection, returning a single item).

It is also more purely mathematical, reducing state (encouraging immutability) and making it easier to reason about work being split up and combined. This is a deep topic, but the important point for us as data scientists is that MapReduce gave horizontal scalability (tackle bigger problems with more computers) as opposed to vertical (tackle bigger problems with a single bigger computer), and economically horizontal scalability is the superior approach. Horizontal scalability also has essentially no ceiling, whereas vertical scalability has a hard limit (though these days it is quite high - you can get a single server with 1 terabyte of RAM).

Because MapReduce splits data essentially randomly (by output key) between workers, it is well-suited for problems where computation order doesn't matter. For example, calculating the mean can be done by summing values in any order, and then dividing by the number of values. MapReduce simply has each worker sum their own arbitrary distinct subset (map), and then return those partial sums to be finally added together in total (reduce). In mathematical terms - addition is commutative (order doesn't matter), so things work out.

How do we do better? It's an open problem! But there are several NewSQL approaches, including shared-nothing architecture (nodes are fully independent) as well as simply using heavily optimized SQL datastores (usually organizing data by column). An important caveat to all of the above - it's great to keep up on the latest and greatest, but many companies are still maintaining legacy codebases and existing solutions.

So it is important and valuable to understand all approaches, and be able to work with any of them to create effective solutions.

Challenge

Think about how you'd apply MapReduce to calculate gradient descent (the algorithm underpinning many statistical models). Try to sketch at a conceptual level/pseudocode, and then search for existing implementations and resources. Spend at least 20 minutes thinking about it before you look things up!

Additional Resources

Guided Project

In this guided project, we'll explore ACID properties and database scalability trade-offs, comparing SQL and NoSQL approaches. Open guided-project.md in the GitHub repository below to follow along with the guided project.

Module Assignment

For this assignment, you'll analyze ACID properties and database scalability trade-offs, comparing different database systems and their use cases.

Solution Video

Additional Resources

Database Theory