Arush Sharma
6 min readApr 10, 2019

--

Big Data, Data Science and Process Mining

Image result for process mining

Today many people say data is the big oil, and this is illustrating the incredible amounts of data that we are collecting, and the corresponding value. If you would think about how much data was generated from prehistoric times,until 2003. And we think what we can now do with it, we are able to generate such amounts of data in just ten minutes.

So, in ten minutes we are generating as much data now,as we were doing from prehistoric times, until 2003. So, this is illustrating the incredible growth of data.

So, what kind of data is this?

What kind of event data are we generating?

Well, we are generating, when we buy a cup of coffee with our credit card. When we make a phone call, we are generating data. When we are getting a speeding ticket. And there are many other examples as you can see on this slide, showing that we are generating all the time.

Even while you are reading this article. You are generating data, because all kinds of things are being recorded. All the data that is being recorded, in all kinds of ways. Let’s look at this Internet of events in more detail.

What does it consist of?

One can talk about four different sources of event data. The first source of event data, is the Internet of content. This is the classical Internet that we know, from Google and Wikipedia. And when people typically talk about big data, they are talking about this Internet. But next to this classical internet of content, we now also have an internet of people. So we have Twitter messages, we have Facebook, all kinds of events, social events that are generating data.

Then we have the internet of things. It’s another source of event data. Today already many devices are connected to the internet, but in the future, many more devices will be connected to the internet. For example, your shaving device, your refrigerator, everything in the future will be connected to the internet, and this will generate large amounts of data. Last but not least, there is the internet of places.

When you are using your mobile phone, as I just illustrated, the phone contains all kinds of sensors that are recording where you are, and what you are doing. And this is another source of information. So this is why today many people talk about big data.

Incredible amounts of event data, that are being recorded.

When people talk about big data, they typically talk about the exponential growth of data. This picture looks very complicated but don’t be scared off by it, it is showing the exponential growth of the number of transistors on a chip. This was predicted by the founder of Intel, Moore, and we can see that this exponential growth has continued over the last 40 years.

Every two years, the numbers of transistors on a chip is

So that means that on a chip now there are one million times the number of transistors, as there was 40 years ago. We not only see this in terms of the number of transistors, but also in terms of computing speed, the capacity of a hard disk, and the number of bytes you get for a dollar, or a Euro. So this is showing this incredible growth. If other fields would have had the same growth of data, we would see very surprising things.

These examples are showing the incredible growth of data, and why people talk about big data. The challenge today is not to generate more data, but the challenge today is to turn this data into real value, and this is a crucial topic.

People often talk about the four Vs of Big Data.

The first V, I just explained, that is the V of volume. So we are generating incredible amounts of data. But that’s not the only challenge.

The second challenge is velocity. We are not only generating large amounts of data, data is continuously being added. And things are changing very rapidly.

The third challenge is variety. So it’s not one type of data. We are confronted with many different types of data, ranging from text to images, to other trails. And we need to combine all these different sources of information.

Last, but not least, a problem of big data is veracity. And that means that you cannot be completely sure, that what you have recorded is completely accurate. For example, your shaving device of the future, will have an Internet connection. Somebody has bought that shaving device. We are recording events from that device how it is being used. But can we be sure that the person who purchased the shaving device, is the actual person using it. That is a kind of uncertainty that you see if you collect data on a very large scale, and you need to be able to deal with that.

I just spoke a lot about big data, but data doesn’t have to be big to be challenging. Data analytic’s questions are everywhere, and that is why there is a very urgent demand for data scientists.

So, what do these data scientists do?

What is their profession?

What is their task?

Well, their goal is to collect, analyze and interpret data, from a variety of sources, and I’ve given you already several examples. That is why this will become a very important profession in the future. And the main goal is to turn data into value. And in this course, we will focus on this particular theme.

If you look at data science, there are four generic data science questions, that you can ask in any situation.

The first data science question, is the question what happened. If we see a bottleneck, if we see deviations, we record event data, we can actually see that these things have happened.

Then the second logical question is, to ask yourself, why did it happen Why was there this delay? Why did people deviate from the expected path? These things are just about the past. But of course data science also aims to answer questions about the future.

So if you look at the future you also ask yourself the question, what will happen?

What can we learn from historic information, to make predictions about what is happening at this point and time?

And then last but not least, the fourth question of data science, is to ask yourself what is the best that can happen.

So the importance of data science hopefully is obvious. It’s also different from data mining. We are not interested in just isolated decisions or low level patterns. We are interested in improving end to end processes.

That is a key thing.

So if you take this process-centric view on data science, and we return to the examples that I gave before, we can see that in a hospital setting, there are many process related questions that we can ask about care flows. If we go to an x-ray machine, we can see that in such a machine, there are many processes unfolding, and you would like to analyze and understand them, and they are generating terabytes of data. So we have a rich source of information to do so.

Let me also sketch some use cases for process mining.

The first use case is to ask yourself the question, what is the process that people really follow?

What do they really do?

Not what they tell they do, but what do they really do?

What are the bottlenecks?

Where are they?

What is causing them?

Where and why do people or machines deviate from an expected, or an idealized process?

These are key questions, and these are just a few examples.

There are many more examples.

What are the highways in my process?

Which factors are influencing a bottleneck, etc.

So process mining is data science in action. We are looking at the dynamics of machines and business processes, and try to learn from them and improve them.

--

--