What is Dirty Data and Why is it a Problem?
Dirty data, also known as bad data, is data that is inaccurate, incomplete, outdated, invalid, irrelevant, corrupt, duplicated, insecure, or inconsistent.
The term "dirty data" is not just an emphatic metaphor, but a matter of great concern, given the magnitude of adverse financial and non-financial impacts that can be associated with it.
Ironically, while we live in a data-driven age, with huge volumes of data being created every day, only 16 percent of business executives are certain and confident about the accuracy of the data that informs their decisions (RMIT Online). Experian reports that companies around the world believe that, on average, 26 percent of their data is inaccurate or corrupt.
Dirty data is indeed a widespread and growing problem in every industry, and there is a high cost associated with low-quality data. Hence, data needs to be of high quality, and data hygiene is a necessity. Ignoring data quality issues can have substantial adverse organizational impacts, including but not limited to financial, operational, and reputational damage.
How to Tackle the Dirty Data Problem
Given the vast amounts of data that organizations collect and store, assessing, improving, and maintaining the quality of that data feels like an overwhelming exercise. Hence, it is best to have robust processes to capture correct data in the first place. However, no matter how good an organization’s data strategy is or how robust its processes are, bad data is bound to creep into its digital ecosystem. This is because data travels between several systems with several touchpoints that can result in contamination.
There are several data quality solutions available that can help assess, improve, and monitor your data quality. The 2022 Gartner® Magic Quadrant™ for Data Quality Solutions recognised 16 data quality solution vendors and showed the leaders, challengers, niche players, and visionaries (See Figure 1).
As per Gartner, data quality (DQ) solutions are "a set of processes and technologies for identifying, understanding, preventing, escalating, and correcting issues in data that supports effective decision making and governance across all solutions available in the market and include a range of critical functions, such as data profiling, parsing, standardization, cleansing, matching, monitoring, rule creation, and analytics, as well as built-in workflow, knowledge bases, and collaboration business processes."
Are data quality solutions the complete cure for dirty data?
I ran a 7-day poll in the Data Quality and Metadata Management group, a data-focused group on LinkedIn, which has more than 6,000 data quality and data governance leaders, practitioners, and advocates across the globe, with the very same question, and the responses are as shown in the screenshot below.
The poll closed with 92 votes from data quality and data governance professionals, and the details are as follows:
The vast majority (83%) chose the option "No," that is, data quality solutions are not the complete cure for dirty data. While data quality solutions can help with the detection and correction of data quality issues and, in some cases, even prevent dirty data from entering the system, they are not a complete cure for dirty data. You also need adequate processes to minimize data issues from creeping into an organization’s digital ecosystem in the first place.
13% chose the option "Yes," that is, that data quality solutions are the complete cure for dirty data. This response indicates that for these respondents, data quality solutions have been effective in attaining the desired level of data quality.
A very small 4% chose the "Other (please comment)" option.
Data Quality Solutions: Concluding Thoughts
As stated by Saharsh Jain, Manager, Customer Success at RoutineAI, in his recent Linkedin article, The Classic Mistake of Measuring a Mountain's Height with a Tape:
However, just like the classic mistake of measuring the mountains' heights using a measuring tape, relying on manual and ad-hoc methods to measure data quality is not the right way.
While data quality solutions are not the complete cure for bad data, they are nonetheless needed to assess and correct data quality issues that have already crept in. Data quality solutions and tools should be chosen with care. There are a number of data quality solutions and tools available on the market from different vendors. You need to see which tool best suits your organization’s data landscape and business use cases and whether it has the features and capabilities to fulfil the requirements.
Also, while DQ solutions automate data profiling and correction, they come with a certain degree of learning curve, take considerable time to implement, and require skilled professionals.
Saharsh Jain recently conducted a poll on LinkedIn with the question -
“What is the most significant drawback of existing data quality solutions according to you?”
57% of the respondents indicated long implementation time as the most significant drawback of existing data quality solutions. For details, refer to the article- The Classic Mistake of Measuring a Mountain's Height with a Tape.
That is, the ratio of Prevention cost : Correction cost: Failure cost :: 1:10:100.
Hence, the aim should be to assess the root cause of the dirty data and address the root cause(s) to prevent dirty data from entering into an organization’s digital ecosystem in the first place. If that is not possible, processes should be defined to have validation rules that prevent bad data from entering into the system or correct data as soon as it enters the system and at the source.
Last but not the least, there is no one size fits all approach to deal with dirty data. Usually, a hybrid approach that is a mix of proactive, preventive and reactive approaches needs to be taken to deal with data quality issues.
To learn more about data quality, including how to measure data quality dimensions, implement methodologies for data quality management, and data quality aspects to consider when undertaking data intensive projects, read Data Quality: Dimensions, Measurement, Strategy, Management and Governance (ASQ Quality Press, 2019). This article draws significantly from the research presented in that book.
If you have any questions or any inputs you want to share, comment here or connect on LinkedIn.
Biography: Rupa Mahanti is a consultant, researcher, speaker, data enthusiast, and author of several books on data (data quality, data governance, and data analytics). You can connect with Rupa on LinkedIn or Research Gate (Research Gate has most of her published work, some of which can be downloaded for free).
One of the solutions to manage data quality is the implementation of a data quality management system. DAMA-NL defined one based on ISO 9001. You can find it on this this wiki:
https://datamanagement.wiki/overview/overview_data_quality_management_system