Is Data Accuracy the Most Important Data Quality Dimension?
Data, Data Quality and Data Quality Dimensions
Data quality has been defined as fitness for use or purpose for a given context or specific task at hand. Despite the fact that fitness for use or purpose does capture the principle of quality, it is abstract, and hence it is a challenge to measure data quality using this definition.
As the management guru Peter Drucker famously said,
“If you can't measure it, you can't manage it.”
Data quality (DQ) dimensions are intangible characteristics of data that can be measured or assessed against a set of defined standards in order to determine the quality of data. Assessment of DQ dimensions help in improving the quality of data. There are several data quality dimensions, the most common ones (as per google research) being- completeness, accuracy, consistency, uniqueness, timeliness, and validity.
Is Data Accuracy the Most Important DQ Dimension?
Data accuracy refers to how closely or how well the data stored in a system reflect reality. It is the degree to which data correctly describe the characteristics of the real-world object, entity, situation, phenomena, or event. Measuring data accuracy requires that an authoritative source of reference be identified and available to compare the data against. If the data shows that Brik Bird’s date of birth is June 12, 1962, but his actual date of birth is June 12, 1926, then the data is inaccurate. However, without an authoritative source of reference—identification documents such as, but not limited to passport, a birth certificate or license that contains date of birth, it is not possible to ascertain Brik Bird’s birth date.
The general misconception is that data quality is synonymous to data accuracy, or that data quality is only about data accuracy. While this is not true, and there are several other data quality dimensions, it made me question,
Is data accuracy the most important data quality dimension?
I ran a 7 days poll in the Data Quality and Data Governance Leadership Forum, a data-focused group on LinkedIn, which has approximately 15,000 data quality and data governance leaders, practitioners, and advocates across the globe managed by the editorial team of Data Quality Pro. with the very same question and the responses are as shown in the screenshot below.
The poll closed at 151 votes from data quality and data governance professionals and details are as follows:
The majority (84%) chose whether data accuracy is the most important dimension, “depends on the business use case”.
13% chose the option—“No”, that is, data accuracy is not most important DQ dimension;
A even smaller percentage (3%) chose “Other (please comment)” option.
For example, time constraints are often extremely rigid for web data, and timeliness may be favored over other data quality dimensions. For instance, a list of courses published on a university website must be timely, though there could be accuracy or consistency errors, and some fields specifying courses could be missing (Batini et al. 2009). On the other hand, for billing and financial processes, accuracy, completeness, and consistency dimensions are more important compared to timeliness.
While data accuracy is an important data quality dimension, usually a combination of DQ dimensions are required for a business case. For example, for reporting purposes, you would need data to be of the right granularity, valid, complete, and accurate. Given the current dismal state of data breaches, data security and accessibility also need to be considered for sensitive data.
Concluding Thoughts and Future Research
Data quality is about striking a balance between all data quality dimensions. Depending on context, situation, the data themselves (e.g., master data, transactional data, reference data, and sensitive data), business needs, and the industry sector, different permutations and combinations of data-quality dimensions would need to be applied and this goes beyond the common data quality dimensions (Completeness, accuracy, consistency, uniqueness, timeliness, and validity). Other DQ dimensions need to considered— such as but not limited, to relevance, integrity, currency, relevance, accessibility, security, trustworthiness, and granularity.
Also, as data and data ecosystems evolve, data quality dimensions will need to evolve.
This is re-enforced by Varun Pant, Director IT, Swati Consultancy Pty Ltd and Ex-National President, DAMA Australia in his statement (as a part of a discussion on data quality dimensions) on LinkedIn as follows,
“With data coming in real time such as streaming data or Internet of things (IoT), it becomes harder to establish the quality of the data, and with IoT going mainstream, the traditional data quality dimensions will need to evolve.”
My future research and articles will be focused on other aspects of different data quality dimensions like the lack of standardization as to the dimensions themselves, their definitions and the interrelationships between the different DQ dimensions.
To learn more about data quality and its myths, challenges, critical success factors, strategy, DQ dimensions, data profiling, and more, including how to measure data quality dimensions, implement methodologies for data quality management, and data quality aspects to consider when undertaking data intensive projects, please read Data Quality: Dimensions, Measurement, Strategy, Management and Governance (Quality Press, 2019). This article draws significantly from the research presented in that book.
References
Mahanti, Rupa. Data Quality: Dimensions, Measurement, Strategy, Management and Governance. ASQ Quality Press, 2019, p. 526.
Batini, Carlo, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. “Methodologies for Data Quality Assessment and Improvement.” ACM Computing Surveys 41 (3): 1–52. Available at https://dl.acm.org/citation.cfm?id=1541883.
If you have any questions or any inputs you want to share, leave a comment here or connect on LinkedIn.