First we had data warehouses, then came data lakes, and now the new kid on the block is the data lakehouse. But what is a data lakehouse and why should we develop one?
In a way, the name describes what it is. The name is a portmanteau, a blending of two words, data lake and data warehouse. It is an architecture that attempts to combine the best elements of the two popular architectures.
Best of Both Worlds?
Currently, in most organizations, data lakes and data warehouses operate side-by-side, as they have been developed for different use cases. Data warehouses support traditional forms of data consumption, such as reporting, dashboarding, and analytics, whereas data lakes are developed to support more experimental and investigative forms of analytics, such as data science, statistics, and quantitative analysis.
Often, both user groups require the same data. Therefore, data warehouses and data lakes contain data copied from the same source systems. The result is an undesirable situation in which data is stored in two different systems, developed with different storage technologies, managed by different specialists, copied with different ETL programs, secured twice, and so on.
To solve these problems, in a data lakehouse the two architectures are blended by storing the data once, using one technology that can be maintained by one group of specialists and can be accessed via both use cases: traditional analytics and investigative data science.
To enable this, data lakehouses store data so that they can support both use cases. The data is stored on low-cost cloud storage systems using open formats, making them accessible for any type of use case. Examples of such storage forms are HDFS with the Parquet file format, Apache Kudu or Apache Delta Lake. In other words, data is not stored in, for example, a database that can only be accessed with one specific database server. Many database servers use a storage format for data that only they understand, making it only accessible by that database server. So, if a database server is not suited to support additional new use cases, the data must first be copied to another data storage platform.
Storing it in an open format makes it accessible to all kinds of tools, including data science tools, Spark, SQL-on-everything engines, data virtualization servers, and all kinds of programming languages, including Java, Python, and R.
Required: A Modern Query Engine
This openness of the data is very useful for an organization. Unfortunately, many tools developed for traditional forms of analytics will have problems accessing those files directly. They need SQL. They require a query engine that operates between the analytical tools and the files. With such a query engine it can look as if the data stored in these open files is stored in a SQL database.
It is important that such a query engine would not only support a SQL interface on the data, but also all the requirements that apply to analytics, including the following:
- In addition to delivering data, the query engine must be able to deliver technical and business metadata and must support a search interface on that metadata for self-service business users.
- It must be able to show the lineage of data.
- It must be able to anonymize and pseudonymize data to comply with data privacy regulations.
- It must be able to exploit the full power of the data storage technology, which means it must support query pushdown.
- It must be optimized for a typical analytical workload, consisting of many concurrent users executing many complex queries that may involve joins of tens of tables.
- It must be able to present data with table structures that fit the analytical workload; this means, for example, that it must be able to transform the physical data structure into a data structure such as a star schema, using dimension and fact tables.
- It must be able to integrate data from the data lakehouse with other systems (without physical copying); nowadays, more and more data used for analytics does not come from a data warehouse, but directly from the source systems or a master data system.
The two primary categories that can offer a mature SQL interface on those open formats are SQL-on-everything engines and data virtualization servers. Note that these two product categories are blending more and more, in the same way that the two well-known architectures are blending to become a data lakehouse.
The data lakehouse is an interesting architecture. Therefore, it is worth checking out if you haven’t already. And if you do decide to develop such an architecture, make sure you select a query engine that supports the features above; a plain SQL interface alone is not sufficient.