Data Veracity – Is it OK to Overlook?
In today’s Internet of Things (IoT), Big Data is the result of connected devices driving information to the clouds. Data is generated from a variety of sources, such as vehicle performance, web history and medical records. It all brings an opportunity to gain insight on trends. Data scientists break big data into four dimensions: volume, velocity, variety and veracity.
In a real world metaphor, data is like water flowing through pipes. Before reaching our homes for use, huge volumes of water flow from different sources at a high velocity with a variety of minerals based from its source.
As long as pure water flows through all the pipes at various levels until it reaches our homes, we continue to get safe drinking water for a healthy life. If one of the sources becomes contaminated, it would affect the water quality (veracity), and assessments would need to be made for purification.
To me, data flows like water. In today’s world with many integrated business systems, a variety of data is flowing between various information systems at high velocities and volume. Many data scientists and big data practitioners are trying to analyze the data to derive intelligence for better business decisions or autonomous devices. While we are focusing to solve big data problems, do we often overlook the veracity or quality of the data?
We are entering the era of Autonomous Devices. We are developing robots as our personal assistants and autonomous vehicles as our personal chauffeurs. We “train” these devices through big data to better meet our needs. What if the veracity of the training data is not guaranteed, and the devices are fed low quality information? Imagine how these autonomous devices are going to behave!
Many organizations spend a lot of money to predict things, based on historical data sets, and the use of statistical and machine learning algorithms. It is much like the way we predict weather or identify possibilities for crime, theft or accidents. Do you think we would be able to predict accurately, if we have problems with veracity of historical data?
Take for example how data veracity could cost a delivery organization. If there is low quality data – such as an incorrect, incomplete or illegible address – it would cost the delivery service time and money to make corrections, return it to sender or risk it being delivered to an unintended party. Again, the problem could be averted if data veracity is at its highest quality.
Just as clean water is important for a healthy human body, “Data Veracity” is important for good health of data-fueled systems.
In dealing with high volumes of the data, it is practically impossible to validate the veracity of the data sets using manual or traditional quality techniques. We can ensure the veracity of high volume data sets using data science techniques, such as clustering and classification to identify the data anomalies and improve the accuracy of data-fueled systems.
While we all appreciate that technology is evolving fast, we need specialists to extract intelligence out of data flowing between various information systems across all the industries. I highly recommend the skills of a Big Data practitioner or a Data Scientist to understand the importance of your Data Veracity, especially as we try to solve today’s problems within Big Data and autonomous devices.