Introduction to Azure DocumentDB
This is a basic introductory post about Microsoft’s new product: Azure DocumentDB, and how it compares to SQL Server.
What is DocumentDB?
DocumentDB is a new “Big Data” database engine created by Microsoft and managed as a service within their Azure Cloud framework (currently available in their Azure Preview Portal).
It is a NoSQL document-oriented database, which saves “Key-Value” pairs of records. It is comparable in principle to other NoSQL document-oriented databases such as MongoDB.
What’s a “document-oriented” database? A document-oriented database is a type of NoSQL database which is used for saving, retrieving and managing “semi-structured” data. “Semi-structured” means that each record (i.e. “document”) can have its own internal structure of fields, sub-fields and sub-records. This is in contrast to a relational database where every table is strongly defined by a specific list of columns and types.
The “Keys” in DocumentDB are auto-generated string identifiers which you can use to quickly find specific records (documents).
The “Values” in this system are strictly JSON documents. These are your “documents”. You can also save “attachments” per each document, which can be files of any type (including binary).
You can even use simplified SELECT queries to get your data! (very similar to what we already know from SQL Server, though more limited)
The Hierarchy of DocumentDB Objects
The hierarchy is also described graphically by Microsoft in the following illustration:
Besides providing quick access using the “keys”, DocumentDB also allows fast and easy querying of documents by utilizing “JSON” indexes that it creates automatically for each and every possible JSON query path.
You can change this behavior and turn off automatic indexing, however that would mean that queries won’t be able to find your data at all, until you index them (though this doesn’t mean that your data is gone).
You can manually create specific indexes however you want and you can even set specific documents/paths to be “excluded” from indexing (this “hides” the documents unless you query them specifically by key). This behavior perhaps can be likened to the familiar “Filtered Index” in SQL Server.
There are also different types of indexes: Hash indexes and Range indexes. And you can even set the update behavior of the indexes: Synchronous or Asynchronous (aka “Lazy”).
As you can figure out, there’s plenty of room for customizability here, but you can also leave everything as the default and everything will be automatically indexed.
Transactions and Consistency
DocumentDB is a transaction-based database. All queries are executed as an ACID compliant transaction. Collections form the transaction domains for the documents contained within them.
In addition, DocumentDB has a tunable consistency model with four modes:
Each mode defines the database consistency level by way of determining how fast writes are committed in all database replicas, and when they’re available for reading. Choosing a consistency mode will affect the availability, redundancy, and performance of your data. The levels range from the strongest but slowest (Strong) to weakest but fastest (Eventual). The default consistency is, as expected, somewhere in the middle (Session).
More on this topic in a future post.
Programming for DocumentDB
DocumentDB provides a RESTful interface over HTTP.
Their .NET library, which can be downloaded right into your project as a NuGet package, is an easy-to-use wrapper around the HTTP API. It requires an Authorization Key and an Endpoint URL – both of which are provided in the Azure web interface when you log into your account.
The DocumentDB .NET API also supports asynchronous querying using a LINQ provider.
Redundancy – Taking the Cloud to the Next Level
Being an Azure Cloud service, DocumentDB sits on top of a cluster of nodes (replicas), and this effectively allows the service to be both redundant as well as very fast in terms of performance, thanks to its usage of load balancing.
However, the setup of this cluster and all the nasty parts usually associated with it are almost completely invisible to you! There’s absolutely no need to install or manage complex distributed servers. Everything is done for you as part of the cloud service. All you do is click on a button!
In addition, you get levels of customizability by, for example, being able to manage the throughput at the collection level based on their specified “performance level”. This can be adjusted dynamically through the Azure portal or one of the SDKs.
SQL Server or DocumentDB?
So which database should you choose, SQL Server or DocumentDB? Obviously, it all depends on the type of data you need to save, and how you plan on querying it.
Being a document-oriented database, choosing DocumentDB is most optimal when you have no clear structure to your data (i.e. you can’t exactly predict the columns and data types you’re going to use, and/or the structure is expected to change often).
Implementing such a scenario in a relational database such as SQL Server can be a serious drag, not to mention the performance problems that can easily arise from such an implementation.
The most common (and obvious) implementation of a “document-oriented” database within SQL Server would be a table with an ID column, and an XML column (the XML obviously being the “document”). Or maybe the XML will actually be a long textual field like NVARCHAR(MAX). In either case, you get one “structured” column (the ID, or identifier), and one “unstructured” column (XML / NVARCHAR / VARBINARY). If you feel that this is the direction you’re being forced to go towards, then it’s definitely the time to consider a document-oriented database (DocumentDB), especially if part of your requirements include querying records based on data within the documents.
In other cases, where you know in advance the exact structure of the data (columns and data types), and this structure is expected to be mostly unchanging, then a relational database (SQL Server) is probably most suited for you.
Another important point to consider is data referential integrity, which obviously cannot be established between two documents, but can be established between two relational tables (hence the word “relational”) using foreign keys and primary keys. If referential integrity is very important in your database model, then DocumentDB is probably not your best choice (although in some cases you might be able to combine it with your relational database).
Combining DocumentDB with SQL Server
Bottom line is… You can’t. Well, not in an integrated, built-in sort of way.
You can only access DocumentDB using its .NET API, using RESTful HTTP, or using the web-based Azure interface.
This means that you either need to implement the integration within your own program by combining connections to both SQL Server and DocumentDB… Or you’ll need to implement CLR assemblies that will integrate SQL Server directly with DocumentDB (which will probably be rather obtuse, but not impossible).