Thursday, July 10, 2014

NoSQL - A Brief Introduction

Intro

What is a NoSQL Database? It's generally considered to be a data storage system that doesn't use relational tables and isn't accessed by structured query language. For decades now SQL and relational tables have been the standard for data storage. SQL storage, or more generally speaking relational storage, is driven by the concept of denormalization of data to reduce redundancy. You have many tables representing a structure of objects/entities, and you can join those many tables together to bring back the entire picture of your data and objects.

Why has NoSQL Become Big News?

Some larger companies such as Facebook, Twitter, and many others have found that modeling their data using a relational database system just wasn't fast enough and didn't scale well with load, and it is quite complex to the casual observer. And, the part I've found most tiresome in my own work, a relational database enforces a rigid scheme. Have you ever spent much time mapping a huge object tree to the tables and columns behind it? It's not terribly fun.

The big relational database systems have spent a lot of time and money optimizing their systems, but at a certain point they become constrained by the disk system behind them. Once your disks are your bottleneck, the only options are to get more expensive and faster storage subsystems (ie faster drives or a faster/better SAN-type system). Plus, if you have a complicated object hierarchy with many levels and relations to model, you have more and more tables to model which means more joining of tables and more processing/disk hits to get your data. There are coding frameworks for dealing with the mapping side of things, such as Entity Framework, but the complexity of the mapping is still there; the tedium is just automated for you.

Types of NoSQL Databases

NoSQL systems have been created with the idea of tackling these problems in various ways. Now this doesn't mean that relational databases are obsolete; rather, with the growing maturity and popularity of various NoSQL platforms, you have more options available to research and choose from. As with many technologies, the more you learn the better decisions you can make so I encourage you to keep reading and see if you can think of some ideas of how you might use NoSQL in your own projects.

There are many different types of NoSQL databases, but I will only give a brief overview of three of them:

  1. Document Store: The general concept behind a document store is that objects/entities are represented in the underlying storage mechanism with a "document". Generally speaking you are encouraged to store your entire object (and all it's children, and all it's children, etc) in a single document. These documents are in various formats depending on the vendor. You might see JSON with one vendor, BSON with another, XML, etc. If you've ever exported an object to JSON, you can visualize a document store roughly as a collection of JSON exports (serialized objects) on a disk system, that you can query with some sort of vendor-proprietary query language. 
    1. Benefits: The perceived benefits of a document store over a relational store are, of course, dependent on how you use the system. Imagine if you have a fairly simple base storage object (a Chair) with 7 child objects (5 wheels, a single bolt, and a single arm; not a comfortable chair I know). Now mentally compare loading this from a relational database with loading it as a single JSON document. To load from a relational database you have to join together (or query separately) 8 different tables. This means at least 8 separate hits to your underlying disk storage system, probably more with indexing. To load the same object from a document store? The object and all of its children are stored in a single document, so you have 1 disk operation. Oh and did I mention that it almost eliminates object-->storage mapping problems? Because you are, for the most part, just serializing your object to some form of document and saving that whole thing, you don't really care what's in that document. It's all taken care of for you. No mapping!
    2. Pitfalls: These can of course vary based on the vendor, but in general you have to have a good idea up-front of what you are going to want to do with your data further down the road or you will end up with a lot of duplicated data that is hard to maintain. Let's say for example that within the arm sub-object of the chair you store the name of the arm's manufacturer. If you have 20 chairs in your database, that's ~20 references to the name of this manufacturer (assuming all 20 chairs use the same type of chair arm) in your storage system. What if the manufacturer changes its name? You now have to go and modify all 20 of the chairs/arms in your storage system. In a relational system with a normalized structure, you modify the one chair arm entry as it's just a single row in a table of chair arms. So, a general rule with document stores is they're best-used when your top-level object is what you care about. If you care about treating lower-level objects as primary citizens of your project ecosystem, you should consider other alternatives. (tip: you can in fact have multiple types of collections of objects in a document store and relate them to each other, but that's a topic for another day).
  2. Graph: The big idea behind a graph database is that it allows for a more efficient modeling of relationships between objects. Every element has a direct link to adjacent elements, and through some magic I don't quite understand this means you don't need to have foreign keys and indexes on them. Thus, lookups of relationships are quite fast. Where might this be useful? Social relationship mapping (I'm talkin bout you Facebook), transportation, etc. 
    1. Benefits: Clearly the benefit here is for representing entities/objects that are tightly related and need to be referenced together with their relations. The specific improvement here is speed. A side effect of the better performance is that they will also scale better (continue to have good performance) even as the data set grows much larger.
    2. Pitfalls: Relational databases are, in general, going to be better at performing summary/group calculations and updates of large amounts of data at the same time. Be sure you're using the right tool for what your system needs to accomplish.
  3. Key-Value Store: As the name implies, a key value store represents the stored entities as a list of key/value pairs. For those of you familiar with the .net world, think of the dictionary class. The key is some sort of value that uniquely identifies the data/value, and the value is whatever you want it to be. Serialized object, an int, whatever. A Key-Value store is almost like an even simpler version of a document store. With a document store the database doesn't care what the structure of your document is, but the document does have a structure and the database can (most of the time) query against it. With a key-value store, the database doesn't know or care what's in there and you can't query on it. You can only pull up a value by key, that's it. Nothing fancy.
    1. Benefits: Very simple and very flexible in terms of what you can store. Because there is no structure whatsoever to the data, you can put whatever you feel like in the value portion of a key-value store.
    2. Drawbacks: Inflexible in terms of how you retrieve data. You'd better be sure you don't need to write any ad-hoc queries on your data, because you can only load a value or not; that's it.


My Experience


I myself have a very limited exposure to NoSQL databases. In fact, I've never used one in production, and I've only spent a few hours developing with them on personal projects. So far my experience with NoSQL has been limited to two specific document store databases, Couchbase and MongoDB.

In general, I have to say I have found both of them quite easy to develop with. Querying data from them is relatively straightfoward, though if you're used to SQL you will have to learn new syntax and all that jazz; sorry. Setup is usually pretty easy, though due to a rather nasty bug at the time I was using Couchbase, MongoDB proved much easier to get up and running consistently in a Windows development environment.

The most useful piece of advice I can give regarding NoSQL databases is: read up about them through some google-fu, learn how they work, and research how others have used them with both good and bad results. Then see if and how you can apply them to what you do. They do have their uses, but be careful and ask for advice before implementing.


Resources


NoSQL
Document Oriented Database
Graph Database
Key-Value Store

1 comment: