Not everything that can be counted counts.
We are currently living in the digital age and are drowning in an ocean of data. Organizations have a large number of data entities and data elements, and a large volume of data corresponding to the same, and they continue to amass more and more data with each passing day. With the large amount of data coming in, it’s important to know what is “quality” data, and what isn’t.
Data entities, elements, dimensions…
Before we continue, let me explain a little data terminology as it pertains to databases, or data storage. “Data entities” are the real-world objects, concepts, events, and phenomena about which we collect data. “Data elements” are the different attributes that describe the data entity. Thus, a data entity serves as the container that comprises all the data elements that describe it.
Consider a machine shop that has many types of machines: CNCs, lathes, presses, and the like. A “machine” would be the data entity representing a physical object sitting on the shop floor, and the data elements might be machine type (e.g., CNC), machine ID, machine name, machine make, machine location, machine uptime, and so forth, which store attribute values for the different machines.
Another term is “data-quality dimensions.” This refers to the characteristics that would define the quality of a data element. Referring to the machine in our example, this would relate to the presence of useful values for each of the data elements in each record of the machine data entity, such as timely availability of the data, accuracy of the data, duplicated values, and so on. Data-quality dimensions are what give you insight as to the quality of the data.
What are quality data?
Data are considered of high quality if they are fit for their intended use. In other words, data quality can be defined as an evaluation of whether those data serve a purpose in a given context. Although data quality is an abstract and cannot be measured as such, it has several dimensions or aspects that can be measured and considered as data quality dimensions. Some examples of data quality dimensions are completeness (i.e., whether values are present or absent), uniqueness (extent to which the data relating to entity are not duplicated), and accuracy (the data values’ closeness to reality).
In the CNC machine example referred to earlier, if our purpose is to track total equipment utilization in our factory, then the machine elements of machine type, uptime, and location would be necessary data for that use and would need to be accurate, complete, and free of duplicates. The color of the machine would not be necessary data.
A data-quality dimension for the uptime data element might be the frequency that data are collected. If we recorded machine uptime for one day out of the year, that wouldn’t be very useful. But if we recorded machine uptime every day, that would be very useful, and thus quality data.
Are the data important? It depends.
Ensuring the quality of all an organization’s data is an expensive and resource-intensive exercise. However, not all data have the same level of importance. Some data elements are critical, and organizations must ensure that they are of high quality, and that they fit their intended use. On the other hand, some data elements might not be of any value and assessing their quality is a waste of time, money, and effort.
For example, many data values are captured and stored for dubious reasons, such as being part of a purchased data model, or retained from a data migration project, but they may not be necessary to achieve any business objectives. Assessing the quality of such data is a waste of time and effort.
Consider a data-profiling exercise that involves measuring the quality of data required for the company’s direct marketing campaign. The question that needs to be answered here is what data does one need to execute a direct marketing campaign? It would essentially require customer contact data, such as names, addresses, email addresses, and so forth. The right data source containing customer contact data and the right data elements — fields holding the customer names, addresses, email addresses — should be selected. However, fields such as those recording comments and job titles are a part of the customer contact data but of no business value for the purposes of executing the market campaign need not be taken into consideration.[2]
Impact of data on the bottom line
A critical data element can be defined as a data element that supports enterprise obligations or critical business functions or processes, and will cause customer dissatisfaction, pose a compliance risk, or have a direct financial impact if the data quality is not up to the mark along one or more data-quality dimensions.
Customer dissatisfaction and regulatory impact can have an adverse effect on finances. For example, a failure to comply with regulations may cause businesses to pay penalty charges. Disgruntled customers may take their business elsewhere, causing loss of revenue. In general, financial impact may include penalty costs, lost opportunities cost, increase in expenses, or decrease in revenue and profit. Thus, the cost associated with the data element, group of data elements, or data entity with respect to different data quality dimensions can be used to determine criticality.
For example, inaccurate name and address data elements in most customer-centric organizations like financial services, telecommunications, utilities, or retail companies can result in huge mailing costs. Hence, for them, address data are critical.
One way to go about understanding the critical data entities and related data elements is by considering the important enterprise obligations that depend on data quality and mapping the data dependencies, i.e., the critical data entities and associated data elements needed to obtain information for each obligation. Data elements that are critical for one enterprise obligation may not be critical for another enterprise obligation.
Enterprise obligations in a retail company, for example, may include sales reporting and consumer-behavior trend reporting. While customer age, annual income, and occupation might be critical data elements for consumer behavior trend reporting, they are not critical data elements for sales reporting.
On the other hand, there are data elements that might be critical for most enterprise obligations. Enterprise obligations might vary by industry sectors or types of business. The following factors can be used to determine the criticality of data elements:
• Number of enterprise obligations for which the data elements are used
• Cost associated with the data elements
• Risks associated with the data elements
• Number of departments, teams, or users using the data
In addition to the above, certain data and information are extremely sensitive and can be classified as critical from the perspective of data security. Examples of such data and information are social security numbers, debit card numbers, credit card numbers, security PIN numbers, pass codes, and passport numbers. Sometimes a data element alone might not be deemed sensitive but becomes sensitive when in a group of data elements. Personally identifiable information is an example this scenario.
Determining and prioritizing critical data elements is one of the first steps that must be carried out before an organization can embark on assessing the quality of its data against the relevant data-quality dimensions that are measurable aspects of data quality. Trying to measure and manage the quality of all data can be an overwhelming and financially infeasible exercise that is bound to fail. Hence, when you think of assessing and improving the quality of data, remember renowned physicist’s Albert Einstein’s comment:
“Not everything that can be counted counts, and not everything that counts can be counted.”
To learn more about data quality, including how to measure data quality dimensions, implement methodologies for data quality management, and data quality aspects to consider when undertaking data intensive projects, read Data Quality: Dimensions, Measurement, Strategy, Management and Governance (ASQ Quality Press, 2019). This article draws significantly from the research presented in that book.
References
1. Mahanti, Rupa. Data Quality: Dimensions, Measurement, Strategy, Management and Governance. ASQ Quality Press, 2019, p. 526.
2. Mahanti, Rupa. “Data Profiling Project Selection and Implementation: The Key Considerations.” Software Quality Professional, vol. 17, no. 4, pp. 44–52.
Note: This article was first published on QualityDigest.com in March 2020, in Sept 2021 on Medium, and another version of this article was published on LightsonData.com.