
There is a lot of talk about data quality, but I often have the impression that this is done almost to exorcise it, to recognize its importance without delving into its complexities, or its real meaning. But this can result in some poor decisions.
Diving into Data Quality
The extensional component of data quality can be easily treated, since it is not difficult to decide if a measurement or value is correct. The same, however, cannot be said of the intensional one, as one would have to determine whether or not the data in question would adequately represent the object being measured, and also to be relevant to the purposes for which one wants to use it, because using data that does not adequately represent what one wants to study but has an extension of quality is worse than using data that is a perfect conceptualization, but whose extension is lacking. In other words, to insert a screw into a piece of wood, a one-euro screwdriver is better than a hundred-euro hammer. Before thinking about the value of a piece of data, we should worry about whether it is the right data to use. Reducing data quality to the values that represent its extension is not only reductive, but it is also dangerous, because it can make us think that we are using something perfect when in fact it was the wrong choice right from the start.
If the breadth of data quality is well understood, what is less understood is what can be done to improve its quality, how far we can go in this effort, and what we can expect from the process, and these are issues for which I believe there is a substantial difference between the intensional and extensional components: For the extensional component, there are well-tested techniques, such as, for example, replacing missing values with others generated on the basis of the statistical properties calculated using the known values, or using synthetic data to increase the sample size or correct imbalances.
The intensional component is much more complex, because here the subjective dimension clearly prevails[1], being the definition of the data being vitiated by the role played by those who take care of it. Also, because those who model something do so in relation to that thing’s primary use, which is what actually made the modeling necessary and, as Ludwig Wittgenstein reminds us when he tells us that “meaning is use”[2], this use is always “here and now,” though this does not guarantee that the same person, in a moment of temporal differences, is not led to call into question what has already been modeled simply because the use he wants to make of it has changed.
The difficulty of acting on the intensional component with respect to the extensional one therefore emerges forcefully, and this difficulty is aggravated by the consideration that we, as human beings, are splendid simplifying machines. On the one hand, this has enabled us to adapt and evolve over time with a minimum expenditure of energy. On the other, it lends us to use what is already available rather than analyzing it with a critical spirit to verify whether it is precisely what we need, and this greatly increases the risk of incorrect conclusions, not because the findings of the data chosen are of poor quality, but because data has been chosen which is only an approximate representation of what one wants to investigate. In the well-known phrase “garbage in, garbage out,” the reference to garbage is always directed towards the accuracy of the measurement rather than on the semantic adequacy of what one is using.
Achieving a Balance
But how can we maintain a qualitative balance between these two components in assessing data quality, especially since every one of our potential individual assessments is always a choral expression, in that they must harmonize with the corporate assessment, which is ideally objective, as it is driven towards a common goal, while also considering different points of view?
Since we cannot eliminate subjectivity (not only is this impossible, but it would not even make sense to do so, even if it were[3]) we must ask ourselves, what is the best way to accept this fundamental undecidability of data quality, and how can we bring this acceptance into our daily activities?
It would start by elevating the value of semantic diversity and making it a key criterion every time one wants to model new data. If reinventing the wheel doesn’t make much sense, perhaps it does make sense to try to continuously improve it, because each wheel has its ideal terrain on which to roll.
Ultimately, one would have to bring this deeper sense of data quality across cultural, organizational, and technological domains:
- The cultural domain, because everyone must free themselves from that innate tendency to always start from scratch, believing that their needs are unique and that what already exists is never suitable;
- The organizational domain, because a collective way of working must be spread and supported, while respecting the needs of individuals but bringing them into a sort of overall conceptual harmony, which starts from the use of what is already available, but which does not preclude the possibility of defining new data where there is a proven need;
- And finally, the technological domain, because any system for managing data is doomed to failure if it cannot manage the underlying complexity, hide it from users, and provide them with that simplicity of use, through seamless data integration and an overarching semantic layer, which would enable them to concentrate on what needs to be done rather than on how to do it, offering features that enable them to evaluate the extensional quality of data and correct it whenever necessary, as described above, but above all by establishing a digital place, such as a data catalog or a data marketplace, in which the lineage of all data can be freely explored and understood.
Harmony in Motion
With the right technology, we can make the link between the two components come alive. In the end, choosing the right data will always be a dance, in which whoever initiates it brings intensionality, and whoever receives it applies extensionality, and if this does not happen with harmony and grace, the stumble will always be just around the corner.
[1] The ontological objectivity defined by John Searle does not help here, because we are not talking about the concept itself, but rather about its representation and it is precisely this aspect that introduces an inevitable subjectivity. For example, according to Searle, mountains are ontologically objective, in the sense that they exist independently of a perceiving subject, but their representation within a data model or ontology will be the work of a specific subject or group of subjects and, inevitably, it will tend to capture some characteristics to the detriment of others.
[2] “Philosophische Untersuchungen” – Ludwig Wittgenstein – 1953. The full form is “The meaning of a word is its use in language”.
[3] One might think of adopting an autocratic organizational model, where a group of elected officials have the power to impose what they believe to be the meaning of each piece of data, but I doubt that such a choice leads anywhere other than self-annihilation.
- Unlocking Data Democracy: Denodo and the European Data Act - June 19, 2025
- Data Quality and Its Intensional and Extensional Components - June 12, 2025
- Data, vital support for a modern PA capable of responding to everyone’s needs - April 22, 2025