Early in my career as a Data Engineer I spent a majority of my time in ETL hell. If you aren’t familiar with ETL it stands for Extract, Transform, and Load. Basically it’s the process Data Engineers use to put structure to unstructured data. In a IoT workload, imagine pulling data from a sensor that shows time, temperature, location, wind, and etc. The data from our sensor would resemble a semi-structured log file, but to put this data into a traditional SQL database some ETL would have to happen. However now this type of workflow is old school and now with NoSQL databases we can pair down on the structure needed from semi-structured and unstructured data. In this blog post let’s go over the 4 types of NoSQL Databases every Data Engineer should understand.
What is NoSQL Databases
NoSQL is commonly referred to as not only SQL, non SQL, or non relational databases. These non SQL databases are built for extreme high throughput using key value pairs versus relational databases with relative dependence. Using loose dependences and quick indexes NoSQL databases are perfect for Streaming Analytics and IoT applications because data can quickly be stored and referenced.
|Relational dependences||Loose dependences|
|Updates to tables time consuming||Updates to tables on-demand|
|Performance dependent on queries & indexes||Performance depend on hardware & network
|Rigid scaling ||Elastic scaling
Types of NoSQL Databases
Not all NoSQL Database are the same! Let’s explain the 4 types of NoSQL databases and their use cases.
The first type of NoSQL database is the Columnar databases which is optimized for reading and writing columns of data as opposed to rows of data. Column-oriented storage for database tables is an help drive down the input/output requirements for database. Since the I/O profile is lowered, overall storage footprint is lowered. One main feature of Columnar Databases is their ability to compress data. Instead of data being written in traditional row orientation, Columnar databases use column orientation. Each column will be associated with column key. Checkout this example from my HBase Blog Post.
See how everything is organized by columns? Sure makes for an adjustment for experienced SQL Data Engineers.
Document databases store data as documents but for storing documents (I used to think this…). The design of the data in Document Databases takes on a semi-structured fashion like JSON or XML. The schema for Document Database is flexible which gives the developer the tools to scale the applications supported by the Database. The design and loose schema requirements allows for high throughput in Document Databases.
"title": "example glossary",
"GlossTerm": "Standard Generalized Markup Language",
"Abbrev": "ISO 8879:1986",
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
Example JSON from http://json.org/example.html
Graph database focuses on how data relates to other data points. Data from Graph databases store directed link between data sets called edges. These edges are displayed in Graphical representation of the data (hence the name). SQL or other NoSQL database can act as the base layer for Graph databases. For example, think of how an application recommends movie to watch next. After a user rates a couple movies relationships can be shown that if a person like Star Wars they might like Lost in Space as well.
Key-value databases function in heavy read environments. Theses types databases rely on compute and memory to speed up reads for supporting applications. Data stored in Key-Value databases follow a object oriented schema style vs. relational schema. Since the data is to be read very quickly the hardware is typically memory or SSD drives vs. traditional HDDs. Data in Key-value databases is held at a premium because storing the data tend to be higher cost vs. the other NoSQL databases.
Popular NoSQL Databases
Cassandra – One of the first Big Data coined NoSQL database. Right now they boast having over 10 Petabytes in production. Cassandra has support for Hadoop and Spark.
Hbase – Top NoSQL opensource on Hadoop choice. Facebook is both a heavy user and contributor to HBase. HBase is Columnar database.
MongoDB – One of the most popular name in NoSQL. MongoDB is a document db using JSON like document schema to store data in database.
BigTable – The brain child of Google. When Google released the whitepaper on BigTable HBase was developed out of the research. Now available through Google Cloud Platform. BigTable is a Columnar database.
DynamoDB – Touted by AWS as the most popular Cloud NoSQL Database. DynamoDB is a Document and Key-value database (really cool stuff). Being out of AWS DynamoDB is an extremely customize able NoSQL database that can scale as the use case decides.
CouchDB – Another Big Data NoSQL Database. CouchDB is a Document database. Designed around ease and flexibility for developer to get running as quickly as possible.
CosmosDB– Azure’s offering for global distributed NoSQL database with scale. CosmosDB was built off the success of DocumentDB.
Want More Data Engineering Tips?