In the world of Big Data
- Processes, standards and quality
It is 7th September of the year 2021; Friday, 23:20. While in the next room people are having a party, we are stuck at work trying to find out why our application doesn’t really work properly. Although we have megabytes of logs and we’ve done all the performance tests we could think of we still have no idea how to make everything work faster. And it seems the bottleneck is somewhere around the infrastructure – logic communication, but that’s the thing we cannot really correct, not at this point of the life cycle of the application…
I mean, there was once a brilliant idea. We have designed an amazing system able to parse gigabytes of information in really short time, a state of the art system, even. Our system could have answered any question the business users asked in mere milliseconds.
Then our paranoid boss started to use our system to monitor all the employees. He was afraid his key people would leave to join another company so he decided to monitor every communication and every aspect of their work activities.
The system, although not really designed for that, could cope with the new demand. It was the state of the art system, after all.
Then, the company started to change. Mere 40 employees turned into 200, then 1000. People stopped using only PCs, mobile devices became the nightmare we had to deal with as well. There was not the only single official communicator in the company intranet – people started using things like Facebook, Twitter and our system had to synchronize all that with the information we had on the company’s SharePoint. Our relational database lying under all that has had several redesigns; we had to find some common keywords, some common ideas to form a new structure.
And now it is the time to pay for our short-sightedness.
The once-brilliant and fast system everyone was proud of is now a thing of the past. Business users have to wait hours to generate the reports and they are, understandably, complaining. Who would have thought that the amount of information generated by the employees (not only in the monitoring module, mind you) would increase hundredfold? Who would have imagined that the company – and the core of the system we were working on – would change so radically?
When we made initial decisions about the shape of the system, we were right. But the circumstances changed so much that now we feel as if we were sitting on a giant bomb, waiting for the countdown to reach zero.
What we can do is simply to shrug and decide we cannot help it. Nobody is able to predict every possible change. On top of that, in the hypothetical scenario above, there is simply too much data. And it means we will have to wait. Nothing can be done beyond optimizing what we can optimize.
The question to ask is – at which point did the system stop being “good enough” and at which point should we, in the role of hypothetical developers from the example mentioned in the prologue, start searching for a better solution, the one better tailored to the requirements of the customer?
Please note: in the prologue there is a hint of the thing to consider when the ideas of “relational database” and “constant changes of data” got connected together. The consequence of choosing the relational database – a good, tried solution – is the fact that we are kind of married to the data structure. Relational databases aren’t really designed to cope with mutable data types and structures; they operate on set, designed tables. And with this level of constantly changing requirements towards data this might have been an imperfect solution.
We are living in the world of constantly changing requirements and fluid business models which also means that sometimes we have to deal with the constantly changing data structure. Constantly changing requirements mean that we have to consider taking risk, some technical debt (which will absolutely have to be paid off in the future unless we want the prologue to become reality) and some time investment. We may have to accept the fact that a project might have to be one of those “Adapt or die” projects in which the standard methods will not work properly anymore.
Not all projects which use the database require that level of consideration. But some do. And they are what this article is really about; the projects which have to deal with the Big Data.
Act I – The world of Big Data:
The origins of Big Data:
While designing the application, we do not know everything. We make decisions about the shape of the application knowing the least about the domain and the future of the application. What our heroes from the prologue had to tackle was the problem of working with a lot of various types of information coming from different sources such as Twitter, Facebook, company’s SharePoint but also mobile devices, GPS devices, sensors of different types etc. Those types of data are very different and it is extremely difficult – if not plainly impossible – to create such a set of data type abstractions to be able to encompass every possible source in the future-proof way.
What is very important – not only we have to gather the varied data but also we have to be able to process and make business decisions based on that data. No matter what information we gather, unless we can use it and convert it into knowledge it is useless from the business users’ point of view.
What we may not be aware of is the fact we are the producers of endless amounts of data ourselves. With every internet service we use from various devices, we leave a trail. The whole internet is the sea of information coming from phones, laptops, bank systems, intelligent cameras, sensors of different types etc. What is important is the fact that year by year the amount of information is constantly growing. And to tackle the problem of using these kinds of information, the Big Data concept was born.
What is Big Data:
A common misconception is the fact that Big Data is just the synonym for “a lot of data”. This is not that simple, but this shall be explained in more details later on.
Once when we were creating the application, we were interested in the subset of all possible data; data dedicated to a particular task realized by this particular application. This was actually the proper way to work with the information – business required only a particular subset of “everything” and we delivered only what directly contributed to business value and the domain. It is not like the “extra” data was harmful for the application, but it was not necessary and as such, discarded. As the business and the domain were more or less stabilized we knew what type and what kind of data we were supposed to work with and we could have formed proper constraints to manage the complexity of the relations between known data subsets. We had full control over the format of the data and thus we could add constraints which helped us to manage the information. We were the masters of the standards we had created. And this is the world which created the relational databases – a current industry standard in data gathering and management. The relevant subset of all possible data with a set of well-formed constraints and rules was a good solution to the type of problems we used to tackle in those days. And, to be honest, to this day most applications follow the same type of problems. Thus, relational database is still the best possible solution for such applications.
Everything changed in the year 2000. The Internet became the ubiquitous tool, not a luxury one as it used to be. And the powerful companies and strong players realized that the stream of information which people willingly put online might be that tipping point which lets them win a customer from the competition. New challenges which appeared with the .Com boom were formulated as “how to manage and process a lot of very varied information”. The solutions which used to work in the past were helpless to tackle the problem defined as such. Well-built systems of the past which used to be an advantage over the competition quickly became a potential hindrance.
Companies like Google or Amazon knew that to keep competitive advantage,, they had to create new solutions able to cope with the changing reality composed of quickly growing data of varied structure coming from the rapidly growing amount of users.
And then social media appeared on the scene, modifying the expectations even further.
This is the moment we are finally able to define what Big Data is. It is not the synonym of “a lot of data” – every banking system has a lot of data with complex relations which do not make those banking systems “Big Data systems”. Big Data cannot be solved by adding new hard drives to the computer racks. Big Data is a large amount of information which has a very varied and complex structure (very often: structure and relations that we cannot really describe and model yet) with varied time of appearance. As this sounds quite complicated, we will delve into details in the moment.
Let’s look at Big Data from the point of view of:
- Volume (amount of information)
- Varied time of appearance in the system
- Varied structure
Several years ago, most applications were designed to handle 1000 concurrent users and it was quite a lot at that time. Nobody designed applications to be able to handle 1 million of users in the same day. This means nobody expected to have to store information created by that 1 million of users per day. To be fair, the technological constraints were different at that time.
If a programmer was hibernated 10 years ago, woke up today and looked at the social media services he would consider those to be under constant overloading attack. For example, Facebook generates about 500 TB of information every day. Every hour, there are 100h of video files uploaded to Youtube.
Big Data is not only a social media phenomenon. For example, a single Boeing 737 generates about 240 TB of data during one flight over USA.
Experts estimate that in 2020 there will be about 40 ZB of digital information.
Modern cars have about 100 sensors. If we wanted to gather information from every single sensor, we would have to process about 1.00E+08 data every minute with the speed of 1.66E+07 data per second.
This is quite a lot.
In the case of on-line gaming we can observe the concurrent access of a million users while a single user generates about 10-20 events per second.
The frequency of data appearance is also varied by the ways of how different users use the system. For example, you can have “normal” users which appear from time to time, or heavy users which are an ambient, constant presence of your servers or occasional users which come, say, on Christmas or Valentine’s Day because there was a promotion.
This implies you need to be able to not only cope with standard demand but also to be able to deal with possibly high data spike generation. And all of that without overbudgeting hardware.
Big Data is an information stream which can contain anything of any type in any context. While relational databases operate on well-defined data types, Big Data has no defined data types. The stream is dependent on the ever-changing trends; it can be text, audio, video, spatial data, personal information, logs, service-specific information and many other things. Big Data is generated from many different sources like PCs, web browsers, applications, sensors, smartwatches or smartphones.
As you can see, Big Data is a very complex and unstructured phenomenon. We could try to visualize it as a large stream of water; something like a tsunami wave which would contain every type of information ever made by human and for a human.
Act II –The Retrospect:
We have managed to delve into the fascinating phenomenon known as Big Data and it sure is a challenge. A question to answer is do we have to tackle this phenomenon on our own. Fortunately, we can step onto the shoulders of the giants who have already dealt with many problems related to Big Data and who have designed tools available for the wider public (and so, for us too).
Our heroes from the prologue have chosen the relational database, a traditional solution to data storing problems. The question to pose at this point would be the one asking if traditional RDBMS would be able to meet our expectations from the prologue. Let’s look at the problem of Big Data from the RDBMS perspective then.
Tradition and the Big Data:
The core of the traditional database is the relational model which has strong types as a central feature (to the point that data which is very hard to associate with a type like an audio or video file is usually put into a so-called “blob” type). Strong typing in a RDBMS gives us a strong cohesion and a very strong structure, which does have a drawback – it forces us to store data in a particular way and to transform the input data to match the database structure.
If we wanted to change the relational model to be able to work with unstructured data (and not put the data into unstructured “blobs”) we would have to create additional relations. Although those relations would give us a more dynamic, adaptive structure (but only up to a particular level of dynamism and adaptation), it would also imply a visible performance hit. This implies that the more dynamic and adaptive the structure is, the more painful is the performance hit and the Big Data definition carries both the “varied” and “velocity” part.
What is more, relational databases were designed to work on a single server mostly. They do scale vertically (addition of CPU/RAM/HDD…) very well, but it is unreasonable to assume eternal scaling to infinity and beyond. With every change of the model we have to perform data migrations, structure rebuilding and additional analysis how this shall be used in the future (for performance).
Relational databases are also able to scale horizontally (addition of more servers in a cloud), but this is not as efficient as the solutions designed under the assumption they will be run in a cluster / cloud. Additionally, one of the main assumptions of relational databases is the ACID principle and the consistency (C from ACID) is exponentially harder to maintain when adding every next server to the database cluster.
Let’s try to look at a Boeing 737 and let’s see how a RDBMS would fare with the data from the aggregated sensors of the Boeing planes in a single airport. An example of a business query we would like an answer to, would be something as follows:
- Find me all anomalies from the single flight of a single plane in a month’s time
- Find me all anomalies from the single day of all planes’ flights on this airport
A small reminder: a single Boeing generates about 10 TB of data in a – 30 minute flight. During a single day there are about 29k commercial flights over USA. This implies a single day will generate about a petabyte of data from the sensors only.
As the traditional RDBMS can only scale as much, operating on the ever-growing set of data is possible only up to a point. After the critical mass has been achieved the queries will be executed much slower, which will lead to the frustration of the users. The requirement for the well-structured data and adapting incoming data to the database structure will lead to performance hit as well. And if the data format changes – even in a well-known source of data – will lead to migration of data, restructuring the data (and we have a lot of it) or defining the dynamic structures – which are yet another performance hit which doesn’t scale very well.
Traditional database vendors have not thrown in the towel and they are trying to introduce new solutions to match Big Data and resulting problems. But this doesn’t absolve us from searching new solutions, simply because we have a project to work on right now.
Act III – What are we really searching for?
The solution matches the problem to solve:
To be able to truly resolve the problem – not accidentally but truly resolve it – we need to understand what the problem really is. The market does supply many tools from different vendors, yet without understanding the problem we cannot choose well. Sometimes we don’t need the most feature-rich solution, sometimes we should choose the solution which has only 20-30% of “leading market solution’s features” but resolves exactly the problem we are having and expect to be having.
How to identify the problem? First step is by asking a set of questions such as:
- Does a user waiting for a response to a query have to receive it in time below one second?
- …and under a heavy load?
- How large will be the set of data the user will operate on?
- …and in 3 years from now?
- How many users will concurrently work with our application?
- …and during a spike (say, Christmas for sales applications)?
- How many sources of information do we expect there to be?
- How often do the sources of information generate the data?
- …and now look at that set of data again. Is it still unchanged?
- Are we able to split the data into segments of any kind?
- Should our solution be scalable? To what extent?
- … if not, some of the questions above can be omitted
- Who will pay for all of that?
Having the answers to those questions and many other similar ones we are confident of only one thing. We have only some expectations. It will be tested by time anyway.
The modes of working with data:
The questions above should lead us to the answer to another question “which mode of working with data are we most interested in”. Two basic modes are Operational Mode and Analytical Mode. Both sacrifice something, but both have some advantages we should be aware of.
Operational Mode is the mode in which we are most interested in, well, operations. This means that the most important thing is the real-time answer, possibly operating only on the subset of data (so, sacrificing the completeness of an answer). The performance and response time are the most important here; acceptable speed is < 100 ms. Operational Mode is used when the time is of the essence; the longer the user waits the higher the chance he will simply go to our competitor. To match this requirement only a subset of the data is being used; possibly even the data connected with one user only. What is very important in this mode is the online processing and to allow concurrent access for about 100k users, though in case of Operational Mode the word ‘user’ can often be substituted by the phrase ‘data source’.Analytical Mode is very often called the Retrospective Mode. It follows the rule of Catch First Analyze Later. A very important component of this mode is the ability to read the vast amount of data to analyze it at the cost of real-time - therefore, unlike the Operational Mode we operate on the larger set of data if not on all of it. As far as the cases I have researched go, it is very hard if not impossible to receive a response below 1s and sometimes the acceptable time is counted in minutes, even hours. This has an advantage though; Analytical Mode allows us to find responses to pretty complex questions, far more interesting than simple ‘write’ and ‘read’ operations. This mode is used in so-called Business Intelligence applications.The Operational Mode supports the Online Big Data while Analytical Mode supports Offline Big Data.Of course, those are the extremes of the scale. We are able to create a solution which will support both Online and Offline Big Data; such solution is called a Hybrid Mode. This mode will act as Operational Mode when the key problem shall be real-time Online Big Data access and when we are dealing with the complex business query spanning vast set of collected data the Analytical Mode over Offline Big Data shall be the processing one. The last step would be the synchronization between the results given by an Offline mode into an Online, real-time tool.
In the case of working with the Big Data we should avoid the rigid, pre-set structure. Not only cannot we really support it, but also it would be contrary to the very idea of the Big Data. In the case of Big Data the real constraint would be the limit which will eventually force us to migrate data and change our approach again.
The question is: do we really need the data validation? Can we really support the validation at all?
The amount of data will simply grow and at one point – exactly like in the case of traditional solutions – the vertical scalability is going to be not only unfeasible but also impossible. What we will have to do is to migrate the data to the next tool with stronger horizontal scaling and being able to segregate the data with higher efficiency.
Act IV – The tools tailored to specific needs:
If we deal with the real Big Data problem, we have to consider using the tools designed for dealing with this particular problem. Assuming we have managed to specify the problem we have (Act III was dealing with that) we need to select the family of tools we are going to use.
- NoSQL (MongoDB, CoachDB)
- Massive Parallel Computing
Let’s consider the NoSQL support for Big Data. NoSQL is able to operate on unstructured data by its very nature. NoSQL databases do not strongly validate the data; very often the validation of the data is omitted up to a point. There exist four basic models of data storing in NoSQL databases (KeyValue, Document, Graph, and Column Family). What NoSQL solutions do very well is processing the information on a larger scale than traditional ones and on top of that, they are designed for horizontal scaling from the very beginning. The data can be easily replicated to the other servers in such a way that in case of a problem with a single server the user will be serviced by another one without knowing it. What is more, NoSQL databases also provide sharding – segmenting the data with the consideration of things like localization and other specified criteria to be able to provide the most often accessed data in particular localizations faster.
Let’s take a document NoSQL database (the data model of such is a document). This particular type of a database allows us to store everything as a document with a changeable structure. But this is not a solution to our problems yet; having no rigid schema doesn’t mean a lack of common problems – it only means we do not have the particular set of problems we were talking about in earlier parts of this paper. We cannot simply store everything incoming inside a JSON document; this wouldn’t solve any problem, simply because this will not answer every possible problem we may have. For example, if we store every incoming piece of information in real-time in the database as a JSON document, most operations we perform are Store operations. If we Store absolutely everything, will we even have an option to ever Read? It is pretty apparent that in this case we can read things only from time to time which means real-time access becomes impossible; such a model will support only Analytical Mode and it will happen only if the application connected to the database will be properly designed for such.
Big Data is not about storing data. Big Data is a problem of dealing with incoming information and processing them into knowledge in the context of the domain knowledge of the business solution we are working on. Each data stream, every Big Data problem is different and focusing on the tool used to store information is not a good choice – we have to focus on the problem to be solved and select a tool which corresponds to this particular set of problems.
The Document in the example above is ultimately just an abstraction and a way to store the data, nothing more. We need to create our own model which shall be eventually stored as a Document, but the model itself needs to be the one corresponding to our problem. Not a schema, not a rigid structure, but the model of storing incoming data and processing it according to the problem we have in a particular, selected solution in form of (in this case) a Document NoSQL Database.
Let’s assume we have that Boeing 737 from an earlier example. If we had a sufficiently large amount of sensors storing information to the central database, we would have 10e8 writes per second. This means we have access to all the data we consider (or will consider) relevant, but the constant streaming of data into the database means the read operations are hard to perform.
What we could do is to modify the application governing the sensors to aggregate sets of data before the data is sent to the central database. This means we will have much less writes (which allows us to perform more reads), but at the cost of a lot of information – we do not operate on raw data but on aggregated data from a sensor in a specified unit of time. Of course, this can be changed and we can modify the stored document to add some information.
Those are the tradeoffs we have to consider. In the case of a black box of a Boeing 737 we can cautiously assume that it is more important to write everything than to be able to make reads more often (as usually black box is read very rarely), but in the case of receiving information by the control tower on the airport from a dozen of different planes, perhaps some form of aggregation would be preferred. Once again the business problem is what will be important in making this selection, not the potential capabilities of the selected technology.
One thing we have to be aware of – there is a point after which even the NoSQL solutions will not be able to cope with our demands. No matter which solution we select, there always is a point after which another solution needs to be selected, each with a different set of tradeoffs and innate problems.
This topic will be continued in the next paper from this series in which we will consider the Hadoop family of solutions.
The solutions we have briefly looked at do solve the problems we face today, exactly in the way the solutions we used to have solved the problem of the yesteryear. This implies that there is no guarantee our solution will survive the test of time. It is pretty likely that in a few years we will have to look for something new which will solve the problems we are not aware of yet.
The businesses of this world are rapidly changing, evolving in the many ways we are not aware of and we cannot really predict and this is why Big Data is a very fascinating phenomenon.
Currently, we are able to manage the uncertainty and Volume – Velocity – Variety of Big Data with many capable tools (many of which Open Source) such as Hadoop or NoSQL databases. What is more, we are able to create an application in such a way to combine the strengths of different tools, to make them work together and so we are able to tailor the solution to the needs of our particular business user.
At this moment we are able to tame the wild, unstructured stream of information. We are also aware of the nature of the ever-changing Big Data. Forewarned is forearmed, and therefore we can look with the optimism into the future. No matter what happens, we will be able to adapt – because we always do.