After a session I recently gave at a conference, someone from the audience posed a question using a metaphor that caught my attention. Making the case for relational databases, he used the example of a music collection, saying that in order to find an album in the collection, it would have to be organized according to some method. He then compared it to data in a database, which he claimed could not be efficiently queried without a well defined schema.
I thought this was a great metaphor, and wanted to explore it further:
Let’s look at what it would mean to organize a music collection using a relational schema: We’d probably start with an Album table, which will include information such as title, artist, release year, etc., as well as some unique album identifier. We would then create a Track table, containing relevant information such as track title, length, etc., and a unique identifier as well. And we’d organize the track listings for each album using a table called Album_Track, which would map the identifiers of the albums to those of the tracks they contained – a typical way to manage many-to-many relationship (assuming the same track can be included in several albums, e.g. “greatest hits”).
This is by no means a complete model, but let’s stop here for a moment to examine it. Using this model, we can easily query the database to find albums by title, artist name, or any other information we stored. And we can also find songs by using the track titles, etc. So far so good.
Now let’s say we have some Classical music in our collection, perhaps a few piano concertos, and we’re trying to enter these albums into the database – who would be the artist for each? the composer? the soloist? the orchestra? the conductor?
Or how about a Jazz album, where we’d probably want to know who played each instrument?
Looks like we’d already need a major re-work of our model in order to fit new data we haven’t anticipated. That would mean dropping the schema, modifying it, and potentially re-writing some of the existing queries. It would also mean that each record will have to include all the new attributes we’d add, regardless of whether it actually had them, resulting in a “sparse” table structures.
If this were a mission-critical system rather than a simple example, the effort involved in accommodating new, un-anticipated forms of data would represent a major effort..
This is one of the reasons NoSQL has become such a popular alternative to relational databases today: it provides schema flexibility, which means the data doesn’t all need to conform to the same exact structure to be managed by the database. There are different underlying mechanisms to enable this, the most robust of which is a the document-oriented model.
A document-oriented database stores each record in the form of a hierarchical tree, which is expressed as document rather than a collection of rows and columns. Each document contains its own structure, which can be used when querying the database. But they don’t have to adhere to the same structure, making the management of diverse data quite simple.
Going back to our example, using document-orientation we can model the data much more intuitively: each album becomes a document, containing elements for artist, title, release year, etc., Tracks are also elements, containing sub elements for their own information.
Any new data we encounter, such as performer, conductor, soloist, drummer, bass player, producer, etc. just becomes another element for that particular album document, without needing to have all the albums contain every new element.
Again, this is a very simplified example, but I think it conveys the concepts rather well, and provides a good explanation for the usefulness and popularity of NoSQL.