What are the Common Dimensions of Data Quality?
“Data is like water: Neglect its quality and fear the indigestion.”
—Adrien Saint, Marketing World Explorer
Why are Data and Data Quality Important?
Data is the enabler and differentiator in today's digital age. Good-quality data form the foundation for every solid operational and strategic process. Good-quality data are essential for providing excellent customer service, operating efficiently, adhering to regulatory requirements, and making sound decisions.
Data quality has been defined as fitness for use or purpose for a given context or specific task at hand. Despite the fact that fitness for use or purpose does capture the principle of quality, it is abstract, and hence it is a challenge to measure data quality using this definition. Data quality can however, be measured in an objective way using data quality dimensions.
What are Data Quality Dimensions?
Data quality (DQ) dimensions are intangible characteristics of data that can be measured or assessed against a set of defined standards in order to determine the quality of the data and help in improving the quality of the data. Data quality dimensions provide a means to quantify and manage the quality of data.
There are several dimensions of data quality, each assessing a unique aspect of the data. Some of the dimensions (for example, reputation, trustworthiness, and credibility) are subjective or qualitative , while others (for example, completeness, uniqueness, and validity) are objective and quantitative. Different researchers and organizations have approached and categorized the dimensions of data quality in a different fashion. In 1996, Wand and Wang noted that there was no general agreement on data quality dimensions, and today, even after more than 25 years since they made this observation, there is still no consensus.
Common Data Quality Dimensions
There are several data quality dimensions, the most common ones (as per Google research) being completeness, accuracy, consistency, uniqueness, timeliness, and validity.
Accuracy—Accuracy is the degree to which data accurately represent reality, whether it is a feature of the real-world entity, situation, object, phenomenon, or event that they intend to model. For example, in Table 1, which holds customer data, the data cells highlighted in blue are inaccurate.
* The values “A” and “B” in the State field are inaccurate.
* The value 1 in the Pin_Code field is inaccurate.
* The values “A” and “B” in the Country field are inaccurate.
While it has been relatively easy to detect these inaccurate values, it is not possible to determine whether addresses in the Address column are accurate and belong to the corresponding customers mentioned in the Customer_Name column. In general, measuring data accuracy requires that an authoritative source of reference be identified and available to compare the data against.
Completeness—Completeness is the extent to which the applicable data are present or absent. Sometimes, values such as “unknown”, “not known”, “to be decided”, or “not applicable” are also used to represent missing data. For example, in Table 1, the cells highlighted in yellow represent missing values in the customer data set.
Validity—Validity is the extent to which data elements comply with a set of internal or external standards, guidelines, or standard data definitions, including data type, size, format, and other features. Say, for instance, with the customer data set in Table 1, the field Country is supposed to hold the name of the country and not just country codes. Hence, the values “AU” and “AUS,” which represent the ISO 2-character and ISO 3-character country codes for the country “Australia” are invalid.
Uniqueness—Uniqueness is the extent to which an entity is recorded only once and there are no repetitions. Duplication is the inverse of uniqueness. In Table 1, each customer record appears to be unique.
Consistency—Consistency is the extent to which the same data are equivalent across different data tables sources or systems.
Timeliness—Timeliness is the time expectation of the availability of data for consumption. If the data are available when they are expected and needed, then the timeliness dimension is achieved.
Concluding Thoughts and Future Research
In this article, we discussed the commonly used data quality dimensions. However, depending on the context, situation, the data themselves (e.g., master data, transactional data, reference data, and sensitive data), business needs, and the industry sector, different permutations and combinations of data-quality dimensions would need to be applied, and this goes beyond the common data quality dimensions.
Measurement of other DQ dimensions such, as but not limited to relevance, coverage, integrity, volatility, currency, relevance, accessibility, security, trustworthiness, and granularity, might need to be taken into consideration. For instance, with compliance, regulatory, and security requirements, data accessibility and data security are gearing towards becoming common DQ dimensions.
Also, as data and data ecosystems evolve, data quality dimensions will also need to evolve, and new data quality dimensions will come into play.
What data quality dimensions do you commonly use? Looking forward to hearing from you.
While high-quality data is an enterprise asset, low-quality data is an enterprise liability. Hence, high-quality data is a “must-have” requirement. While measurement is an integral part of the data quality journey, not all data quality dimensions need to be measured, nor do all data elements need to be subject to assessment and measurement. Only those data elements that drive significant benefits should be measured for quality purposes.
My future articles will focus on different aspects of data quality and related topics.
To learn more about data quality and its myths, challenges, critical success factors, strategy, data profiling, and more, including how to measure data quality dimensions, implement methodologies for data quality management, and data quality aspects to consider when undertaking data intensive projects, please read Data Quality: Dimensions, Measurement, Strategy, Management and Governance (Quality Press, 2019). This article draws significantly from the research presented in that book.
References
Mahanti, Rupa. Data Quality: Dimensions, Measurement, Strategy, Management and Governance. ASQ Quality Press, 2019, p. 526.
Mahanti, Rupa, Data Quality and Data Quality Dimensions, Software Quality Professional. November 2019, Volume 22 Issue 1, pp. 4-8
If you have any questions or any inputs you want to share, leave a comment here or connect on LinkedIn.