Controlled Vocabularies, Taxonomies, Thesauri and Ontologies for Knowledge Management: A Primer

One of the key benefits of this type of metadata is harmonization of terminology across systems

Reading Time: 2 minutes

Typical estimates for the development of a new small molecule drug are that it involves preparation and testing of more than a thousand candidate molecules and results in a New Drug Application of over 100,000 pages. The data supporting such projects exists in diverse sets that vary greatly in size, complexity and structure. Creating this data represents an enormous investment of time, labor and money. Yet efficiently finding, sharing, and reusing it across an organization remains a significant challenge and knowledge management initiatives have a high failure rate. Describing data with an accurate and complete set of metadata is the first step in realizing its full value.

The table below contains a small sample from a master data repository of manufacturing lot information for a drug candidate. It illustrates a common occurrence when integrating data that is generated by multiple systems and/or users. Without a common terminology different ad hock “formats” and definitions emerge with a resulting breakdown of data harmony.

Description	Amount	Size
Film-coated tablet	20	mg
FC tablet	20 mg
20 mg tablet, film coated
Red film coated tablet		20 mg

Metadata in the form of controlled vocabularies, taxonomies, thesauri and ontologies are collectively known as “vocabularies.” They differ in the complexity of the information represented and in how they’re expressed. However, all identify and categorize digital content and provide contextual information about that content. One of the key benefits of this type of metadata is harmonization of terminology across systems.

A useful model to understand the different types of vocabulary and their respective applications is the semantic spectrum. The semantic spectrum describes the logical rigor of a vocabulary’s underlying knowledge representation system. This, in turn, informs the capabilities and limitations of the vocabulary when it is used by consuming systems.

Increasing logical rigor enables powerful applications, particularly in terms of machine-based manipulation of data, but at the cost of a vocabulary that requires significantly greater skill and effort to build and maintain.

Enumerated lists: Glossaries and controlled vocabularies
Taxonomies: Concepts are organized into a hierarchical structure of broader and narrower meanings
Thesauri: Includes non-hierarchical relationships such as equivalence (for example synonyms, acronyms and abbreviations) and association
Ontologies: Rigorous models that strictly follow rules of description logic and are encoded in a formal ontology language

Finally, it’s worth pointing out that vocabularies in isolation provide little value. Their value is realized when they are used to describe an organization’s data assets. In conjunction with other technologies this enables data integration, data lifecycle management, search and other capabilities that create business value. An effective metadata development and governance strategy takes into account both the generation and consumption of data and the specific needs and constraints of the relevant systems and users. This guides modeling at the appropriate level of complexity to meet current needs and provides a structure that can support change as business and technology evolves.

This blog was penned by John Tulinsky, PhD, Senior Consultant at LabAnswer. LabAnswer is a proud supporter of Denodo DataFest.

Author
Recent Posts

LabAnswer

LabAnswer is the world’s premiere laboratory informatics consulting firm. They partner with laboratory and research organizations to strategically develop and implement systems that transform their business. LabAnswer's laboratory and scientific data analytics experts give businesses the ability to derive real intelligence from all of the information and data they collect. www.labanswer.com

Related Posts

Leave a Reply Cancel reply