The Dirty Data Problem
Poor-quality data is a huge problem. It leaves many companies trying to navigate the information age in the equivalent of a horse and buggy.
Bruce Rogers (Leopold 2017)
Dirty data, also known as bad data or poor-quality data, is data that is inaccurate, incomplete, outdated, invalid, irrelevant, corrupt, duplicated, insecure, or inconsistent. The dirty data is mounting with the exponential increase in the size of databases and further increase of information systems. Dirty data can have substantial adverse organizational impacts, including but not limited to financial, operational, and reputational damage.
As per an IBM survey in 2016, dirty data is estimated to cost the US economy over $3 trillion every year.
Poor data quality is a primary reason for 40 percent of all business initiatives failing to achieve their targeted benefits (Friedman and Smith 2011).
From executive-level decisions about mergers and acquisition activity to a call-center representative making a split-second decision about customer service, the information an enterprise collects on virtually every aspect of the organization—customers, prospects, products, inventory, finances, assets, or employees—can have a significant effect on the organization’s ability to satisfy customers, reduce costs, improve productivity, or mitigate risks (Dorr and Murnane 2011, Mahanti, 2014). Given the magnitude of adverse financial and non-financial impacts that can be associated with dirty data, it is a matter of great concern.
Is Dirty Data the Cause or the Effect?
Is dirty data the symptom or the disease itself?
Is data the victim or the culprit?
Is dirty data symptom of a problem that has nothing to do with data?
If bad data is the symptom or effect, what caused bad data in the first place?
I ran a two weeks poll in the Data Quality and Metadata Management group, a data-focused group on LinkedIn, which has more than 6,000 data quality and data governance leaders, practitioners, and advocates across the globe, with the question, “Is Dirty Data the Cause or the Effect,” and the responses are as shown in the screenshot below.
The poll closed with 58 votes from data quality and data governance professionals, and the details are as follows:
28% of the respondents chose the option "Cause,"
55% of the respondents chose the option "Effect,"
17% of the respondents chose the "Others (please comment)" option. Most of the these respondents commented stating that dirty data is both the cause and effect.
Dirty data is a vital cause for bad decisions originating from incorrect results and conclusions derived from analysis and processing of dirty data. In few scenarios, bad data itself is the root cause, that is the culprit. However, in most cases, while bad data might look like the immediate cause, culprit , when you dig deeper, there might be underlying causes behind the bad data, thus making it the symptom, effect or a victim. It is much like cold and cough, which can be the disease itself or might be a symptom of a disease like cancer.
What are the Causes of Dirty Data?
We have all heard of GIGO (Garbage in, Garbage Out). But, what caused GI, that is, “Garbage In” in the first place?
Dirty data or data quality issues can be found at all levels and within all the components of information, including definition, content, and presentation (Scarisbrick-Hauser and Rouse 2007). Data issues can sneak into every phase of the data life cycle, starting from initial data creation and procurement/collection through various touch points like Internet, call centers, sensors, mailings, and branch offices, to name a few, to data processing, transfer, and storage, to archiving and purging. Some causes of dirty data are summarized as follows:
Manual data entry can give rise to data issues. Human beings are prone to errors and typo errors are common.
Data capture is not always implemented in many transactional systems with well-thought-out validation, result in dirty data entering the system.
Data aging or degradation of certain data over time and lack of data quality management practices and governance processes to keep data current, results in bad data quality. While time has no impact on certain data, like date of birth or place of data, that remain a constant, some data, for example, contact data need regular maintenance. To ensure that data are up to date, it’s important to set guidelines for how often each field should be updated.
Poorly designed and implemented business processes that result in lack of training, coaching, and communication in the use of the process, and unclear definition of process ownership, roles, and responsibilities have an adverse impact on data quality.
Data purging may accidentally impact the wrong data, or purge some relevant data, purge more data than intended, or purge less data than intended, resulting in data quality consequences.
Data cleansing programs, while resolving old data quality issues, might introduce new data quality issues.
Data migration, system upgrades, and data integration can introduce bad data.
Organizational changes like corporate mergers and acquisitions, globalization, restructuring or reorganization, or external changes also result in bad data.
Concluding Thoughts
While it is best to have robust processes to capture correct data, no matter how good how robust an organization’s processes are, bad data is bound to creep into its digital ecosystem. However, good quality data are critical to good decisions. Hence data must be carefully managed to ensure they are of adequate quality. Achieving this level of success requires concrete data management practices including data quality and data governance.
To learn more about data quality, including causes of bad data, how to measure data quality dimensions, implement methodologies for data quality management, and data quality aspects to consider when undertaking data intensive projects, read Data Quality: Dimensions, Measurement, Strategy, Management and Governance (ASQ Quality Press, 2019). This article draws significantly from the research presented in that book.
If you have any questions or any inputs you want to share, comment here or connect on LinkedIn.
Biography: Rupa Mahanti is a consultant, researcher, speaker, data enthusiast, and author of several books on data (data quality, data governance, and data analytics). You can connect with Rupa on LinkedIn or Research Gate (Research Gate has most of her published work, some of which can be downloaded for free).
References
Dorr, B., and R. Murnane. 2011. Using data profiling, data quality, and data monitoring to improve enterprise information. Software Quality Professional 13, no. 4:10-18.
Friedman, T., and M. Smith. 2011. Measuring the business value of data quality. Gartner ID# G00218962. Last available at: http://www.data.com/export/sites/data/common/assets/pdf/DS_Gartner.pdf.
Mahanti, Rupa. Data Quality: Dimensions, Measurement, Strategy, Management and Governance. ASQ Quality Press, 2019, p. 526.
Mahanti, Rupa. 2014 Critical Success Factors for Implementing Data Profiling: The First Step Toward Data Quality. Software Quality Professional Magazine, 16.