Including external data in many forms of analytics and reporting can provide significant enrichment. New business insights can be discovered that would be impossible to unearth by working exclusively with internal data. The good news is that external data is available in abundance. Numerous commercial and government organizations can provide valuable external data.
WORKING WITH EXTERNAL DATA IS COMPLEX
Unfortunately, most analysts do not often leverage external data, and when they do, they do so in an ad-hoc manner. First, they must determine whether or not using the external data has a cost. Next, they must copy the data from the source. Then, they need to study the technical structure and format of the data and determine how to store that data in a format that enables them to easily analyze it. Most importantly, they need to study the data to make sure that they understand it and can interpret it correctly. Finally, they must study how the external data can best be joined with the relevant internal data.
THE DRAWBACKS OF ISOLATED USES OF EXTERNAL DATA
This approach of working with external data has several drawbacks:
- These ad-hoc solutions are not always shared with other analysts who may be interested in the same external data source. This results in countless isolated solutions within an organization. In other words, analysts are inventing the wheel over and over again. This is time-consuming and decreases overall productivity.
- Due to the isolated uses, some analysts may use and/or interpret the external data incorrectly. This can lead to incorrect conclusions.
- Owners of the external data may change the way the data is delivered, how it is structured, and even its meaning. With this ad-hoc approach, each analyst must correctly deal with these changes which, in turn, is bad for performance and correct processing.
- This complex, time-consuming process of working with external data may also deter analysts from working with it, which can result in missed business opportunities.
More and more organizations understand the value of analytics and the use of external data. To avoid the above drawbacks, access to this external data needs to be streamlined.
STREAMLINING EXTERNAL DATA ACCESS WITH DATA VIRTUALIZATION
Data virtualization servers support several features to streamline access to external data:
- The technical aspects of extracting external data from a source can be defined within a data virtualization server only once by a data engineer and then reused by many analysts. This is true even when analysts want to use different technical interfaces to access the data and when the data must be delivered in different forms or formats.
- Within a data virtualization server, descriptive metadata can be added to the external data to describe what the data means and how it must be interpreted. This increases the chance that the external data will be interpreted correctly.
- When the owner of an external data source changes an aspect of how the data needs to be extracted, the format in which the data is delivered, or the meaning of data, again, this change only needs to be implemented once within the data virtualization server, rather than by all the analysts in their isolated solutions. All the changes made in the data virtualization server are hidden to the analysts unless the meaning of the data is changed, at which point the analysts can be easily informed.
- With data virtualization, new data sources can be quickly made available to analysts. In fact, a full catalog of external data sources can be created, with structured tags that describe the characteristics of these sources. For example, for each source, it may indicate the trustworthiness, the timeliness, the owner, or the data quality. This type of information would help analysts to determine whether the source can be used for their analysis and provide guidance on how to do so.
AN EXAMPLE OF EXTERNAL DATA ACCESS WITH DATA VIRTUALIZATION
To demonstrate how comprehensive the features of data virtualization are for streamlining external data access, Denodo developed a public website that allows anyone to study trustworthy, up-to-date global data on COVID-19 (Coronavirusdataportal.com). This portal brings together external data on COVID-19 from disparate global data sources, curates the data, and provides the results to data consumers, such as data scientists, analysts, and researchers, to find solutions to this deadly disease.
EASY DOES IT
External data can play an important role in analytics and data science. However, leaving the extraction, interpretation, and processing of that data solely to analysts could lead to poor productivity and incorrect results. Data virtualization streamlines access to external data, ensuring easy and structured use.
- Metadata, the Neglected Stepchild of IT - December 8, 2022
- Zero-Copy Architectures Are the Future - October 27, 2022
- The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022