If you are in the data management world, you probably help your company to redefine its data analytics architecture, especially in the context of cloud adoption. At some point in the process, someone might have stared at the myriad boxes and acronyms in your architecture and said something like: “Now that we are redefining all this, couldn’t we simply consolidate all of the data in a single system like a data lake, or a data lakehouse”?
This is a natural question. Managing different systems, which use different technologies, is complex and costly from many points of view, so a single platform, one that takes care of all of our data analytics needs, makes intuitive sense. And some vendors in the market are telling companies that now this is possible because cloud storage is cheap and their products can cover almost every analytics style.
Let’s Take a Step Back
But this is an idea with a complicated history. Not even in the much simpler world of the 90s were we able to store all data relevant for analytics in a single data warehouse. So there is some reason for skepticism, now that data volumes are hundreds of times bigger and data is more distributed than ever. And we heard a very similar story about data lakes, (often from the same companies) about 10 years ago. As always tends to happen, reality then showed that data lake technology was great for some tasks but not so great for others.
Will this time be any different? I am skeptical, because heterogeneity and widely distributed data were never driven by storage costs. They come about because organizations are composed of autonomous units that make independent decisions. They also come about because different departments use the best tool for each task, because one size never fits all.
That is why most large companies today maintain several different data warehouses and data lakes on-premises and in the cloud. That is why they also have operational data stores, NoSQL repositories, cloud object stores, and SaaS applications. They probably would like to decommission a few of these systems but they cannot without facing costly migration processes; but in many other cases, these systems need to stay simply because they are the best way to solve a particular need.
As Gartner said recently, “a single data persistence tier and type of processing is inadequate when trying to meet the full scope of modern data and analytics demands.” And if you look closely at the reference architectures of cloud providers, you will see that even if you are able to move all your data to the cloud (and that is a very big “if”), every cloud provider offers different processing engines for different tasks and data types.
Of course, it is still true that the power provided by this diversity of systems comes at a price. If data is distributed across different systems, integrating that data can be slow and costly. Also, users no longer have a single access point to the available data. Security and governance also become more difficult because you need to ensure that consistent policies are applied across all systems.
A Better Way
The solution that major market analysts like Gartner and Forrester are proposing is distributed architectures like logical data warehousing and data fabric (Also see Gartner’s “Demystifying the Data Fabric”). A key component in these architectures is the data virtualization layer, which bridges the gap between the data processing engines and the data consumers.
This layer provides the following crucial features:
- Agile data integration of data distributed across different systems, much faster and cheaper than traditional alternatives based on data replication (Gartner estimates savings at 45% when using data virtualization)
- A unified semantic layer, enabling organizations to create virtual models that present data to each type of consumer in the shape, format, and structure that each one needs, and enabling stakeholders to organize virtual models in layers, to encourage the reusability of semantic definitions and data transformations
- The ability for consumers to access data using any technology such as SQL, REST, OData, and GraphQL – Data APIs can be created in minutes, with zero code
- A single point from which to apply semantic, security, and governance policies across all data processing engines, and a single source of truth for data consumers
- The ability to implement “data marketplaces,” where business users can go to find and get access to relevant data, no matter where it is located
- A data-abstraction layer, enabling organizations to move data from one location or system to another, without affecting data consumers
- Intelligent caching and acceleration, which enables the selective replication of small data subsets, accelerating queries on slow data sources, as well as the ability to accelerate distributed queries (See this post for the technical details)
My advice would be simple: learn from history, assume you will never have a magic bullet that solves all of your analytics needs, and plan your data management architecture accordingly.
- Performance in Logical Architectures and Data Virtualization with the Denodo Platform and Presto MPP - September 28, 2023
- Beware of “Straw Man” Stories: Clearing up Misconceptions about Data Virtualization - November 11, 2021
- Why Data Mesh Needs Data Virtualization - August 19, 2021