The Data Lakehouse Myth
Reading Time: 2 minutes

The data lakehouse attempts to combine the best parts of the data warehouse with the best parts of data lakes while avoiding all of the problems inherent in both. However, the data lakehouse is not the last word in data management.

Best of Both Worlds?

It is easy to consume data from data warehouses, but because they require a highly structured format, it takes some effort, usually in the form of transformation, to first get the data into data warehouses.

Data lakes have the opposite advantages and disadvantages: It is easy to get data into data lakes, because they can take structured or unstructured data, but it is difficult to consume data from them, because, before the data can be of any use, it first needs to be formatted.

The data lakehouse addresses both problems by first enabling the ingestion of structured and unstructured data, and secondly by supporting the consumption of data in a structured fashion. Beyond that, it supports schema-on-read, so that at the point of consumption, the data will be automatically transformed into the needed format.

The “Collect” Principle

The challenge with the data lakehouse concept is that it attempts to be the one central repository in which to store all enterprise data. However, not only has this proven to be ultimately impossible, but it has also proven to be inadvisable.

If you look at today’s data landscape, it’s an environment in which data is distributed, and will always be distributed. Despite our best efforts, it is still populated by multiple data warehouses, data lakes, applications, SAS systems, and myriad different databases, and they are not going away anytime soon, for a variety of economic, cultural, and legal reasons. All these systems are out there collecting information, and it would be extremely time-consuming and expensive to take all of that data and attempt to squeeze it into one centralized repository. In short, this is called the “collect” principle.

A Better Way for Data Lakehouse

A better approach, the “connect” principle, is to leave the data wherever it is, logically connect to the data, as needed, and make a unified view of that information available to consumers, without having to physically move any data. By applying the “connect” principle to the traditional, physically based data warehouse, the traditional data warehouse gets transformed into a logical data warehouse.

A logical data warehouse is an even better solution than a data lakehouse because it provides data to consumers in real time. Also, because consumers connect to the data rather than collect it, the data is not unnecessarily duplicated and is never out-of-sync with the sources, because it connects directly to the source data.

Logical data warehouses are powered by a data integration and data management technology called data virtualization, which provides the underlying “connect vs collect” capability. Data virtualization enables real-time views of disparate data sources, established as an enterprise-wide data-access layer.

With a logical data warehouse powered by data virtualization, business users can maintain uninterrupted, real-time access to data sources, even if those sources are in the process of being moved, as the data virtualization layer abstracts consumers from such complexities. In addition, by maintaining the connections between data consumers and data sources, the data virtualization layer provides a seamless foundation for enterprise-wide data security and governance protocols.

In Good Company

The logical data warehouse configuration was first proposed by Gartner in the first decade of the second millennium, and has since been successfully leveraged by many companies, through data virtualization; reach out to learn more.

Ravi Shankar