Introduction to Big Data
So you’ve been working as a DBA or a BI developer for several years now. You know your stuff very well, and you consider yourself an expert in your field. And now there is this trend called “Big Data” that everyone seems to talk about. You’ve heard this term many times, and maybe even your boss told you one day that you should adopt big data solutions. But what does it mean? Is it a real thing or just a buzz? And should you seriously consider learning about big data and leaving your comfort zone? If you’re asking yourself these questions, then this post is for you. If not, then you probably have better things to do…
So what is Big Data?
Well, it’s data, and it’s big. How simple was that? OK, seriously now… It is very common to define big data based on 3 characteristics called the 3 Vs: Volume, Velocity and Variety. The internet is full of articles talking about the 3 Vs. Some talk about 4, 5 or even 6 Vs, adding Veracity, Validity and Volatility to the list. I even read an article once that mentioned the 7 Vs, although the 7th item on the list was “Complexity”. Go figure… But the first and most common Vs are the important ones. The rest seem to me a result of copywriters trying desperately to find more words that begin with V. So let’s ignore them.
Volume is about the amount of data. There is no magic number that distinguishes between small data and big data in terms of volume. It depends on the hardware and software used, the system architecture, the workload, the skillset of the people in charge of the data, and more. My definition of big data is when you need to manage data differently. If you are in charge of two systems, one stores 100GB of data and the other stores 2TB of data, and you need to apply different methods to manage the data (e.g. querying the data, taking backups, archiving the data, etc.), then I would consider the 2TB system to be a “Big Data” system. But if you use the same methods to manage the two systems, then in your case, 2TB is probably not big data, but maybe 20TB is.
The second “V” is Velocity. This is about the rate in which data is moving through the system. It is common to measure the rate of data as it enters into your system from external sources. For example, you might have a web analytics system that collects clickstream data, such as page views and clicks in an average rate of 300 events per second. If the system scales up, and now the rate is 3,000 events per second, but you can still cope with the velocity with the same architecture, tools and skillset, then this is still not big data for you. But as soon as the rate reaches 30,000 events per second, then you might need to redesign your system to be able to handle this rate, and this means you’re in the big data zone.
Big Data is not only about Volume and Velocity, but also about Variety. An important aspect of Big Data is the variety of data sources and data types. Traditional systems usually deal with structured data, which means the type of the data is known in advance, and there are no surprises. In the majority of cases, when people talk about structured data, they refer to the relational model, where data is formatted in tables and columns and rows. When you handle many data sources, then you might also have semi-structured data and unstructured data. Semi-structured data means that there is structure, but it’s not necessarily known in advance. A good example is a JSON document, which has a well-defined structure, but each document can contain different fields of different data types. Unstructured data means there is no structure at all. Examples for unstructured data would be a complete web page or an image. Does your data need to be “big” in all 3 Vs in order to be considered Big Data? Not necessarily. As soon as you need a different set of methodologies, technologies and skills to handle your data due to any of the 3 Vs, then you’re dealing with Big Data.
Where do I begin?
Big Data is a big world. It’s full of buzzwords, like “NoSQL” and “MongoDB” and “Hadoop” and “Spark”. There are all kinds of data platform types, such as document databases and graph databases, and there are also all kinds of vendors, such as Cloudera, Hortonworks and DataStax. And not only there are so many different platforms and buzzwords, it’s also changing all the time. New platforms emerge every few weeks, and the hot trend of today might be obsolete several months from now. I find it funny that the big data ecosystem actually behaves exactly like the big data itself. There is a large volume of platforms and tools. The velocity in which platforms and tools are released and developed is very high. And there is also a variety of data platform types. To get started, I’m preparing a full-day training about big data, especially for someone like you. Someone who has heard about big data, but doesn’t know exactly what it means and where to start. During this training day, I’m going to introduce big data, explain each of the data platform types as well as compare between them, demonstrate all kinds of tools for handling big data, and present big data use cases. By the end of the day, you will have a good understanding of the big data ecosystem. You will know what a document database is, and when you should prefer Couchbase over MongoDB, for example. I’m going to deliver this training day as a pre-conference session in the upcoming SQLBits event in Liverpool on May 4th. You can find more details here, and you can register here. I hope to see you there.