Data Virtualization: The Key to a Successful Data Lakes

Reading Time: 3 minutes

If you’ve decided to implement a data lake, you might want to keep Gartner’s assessment in mind, which is that about 80% of all data lakes projects will actually fail. Obviously, you want to be in that 20% that succeed. But how do you get there?

Diving into Data Lakes

Roughly five years ago, all data lakes were based on Hadoop. Architects’ very first thought was about how to stand up the Hadoop cluster, and their very next was how best to load data into the lake. They never thought about how data was actually going to be consumed.

As a result, only the most technically minded individuals could access and use the data from these Hadoop-based data lakes. Inside the data lake were separate data silos, accessible to only the most sophisticated data analysts or data scientists. Obviously this did not provide an impressive return on investment. Companies would lay down $5 or $10 million to build out these data lakes, only to discover that only 5% of their users could get any benefit from them.

Recently, data lakes have moved to the cloud, and organizations have been expanding their scope, so they’re not just for the most technical users.

Why Do Data Lakes Fail?

When organizations opened up data lakes to “all users,” this added considerable complexity to the data lake and its supporting infrastructure. When multiple users from different departments access data, companies need data governance, security, and reliable methods for each of these different individuals to easily find the data they need. These are the additional burdens that often trip up data lake architects. Data lake projects will fail because organizations lack governance, security, and easy access, which are all necessary in order for all users to benefit from data lakes.

The Data Virtualization Angle

Data virtualization is a technology that can easily overcome these limitations, dramatically improving the chances of a data lake project’s success. Data virtualization is a data integration technology, but rather than physically replicating data, data virtualization provides real-time views of the data, on demand, in its existing location, be it on premises or in the cloud.

Data virtualization helps companies focus on how different individuals can easily find data in the enterprise data lake. The Denodo Data Catalog, part of the Denodo Platform, provides a visual interface that enables users to browse through the available data, and even take a sample of it, regardless of where or how it is stored within the data lake.

Companies can establish this unified data access layer, enabled by data virtualization, right on top of the data lake. This layer would be above the data stores within the data lake, providing a single place to find the needed data.

Data virtualization also enables all users to access the data using the tools that they’re most familiar with. If I’m “Paul, the pivot-table wizard,” and I love using Excel, I can actually connect my Excel spreadsheet to the data virtualization layer, and it provides the data I need from Redshift, EMR, Athena, etc., without my ever having to consciously connect to the cloud.

Because data virtualization establishes a unified data access layer, it also provides a single place from which to manage data governance and security protocols across a company’s diverse data holdings. This enables companies to easily set universal access controls to secure the data, so that people can only see the data they are allowed to see.

For maximum architectural flexibility, the Denodo platform also enables companies to easily integrate data that is outside of a data lake, such as an on-premises data warehouse.

The Autodesk Story

Autodesk, the developer of 3D drafting software and related products, built a data lake on Amazon Web Services (AWS), where the company store heavy volumes of complex, multi-structured data from weblogs, click streams, and other data. Leveraging the Denodo Platform, Autodesk seamlessly integrates this data with data coming from SAP and the company’s on-premises data warehouse. As a result, Autodesk’s marketing team and other business units gain timely insights on how customers are using Autodesk products, to reduce churn and meet other important business goals.

Data Lakes, Reimagined

There are clear benefits to leveraging data virtualization in conjunction with your next data lake project. It enables different people to easily access the data they need from a data lake, whether it’s inside of a data silo or even completely external to the data lake. With improved data governance, security, and access, organizations can count on their data-lake projects being among the 20% that succeed.

Author
Recent Posts

Paul Moxon

Paul is Senior VP of Data Architectures and Chief Evangelist responsible for product management and solution architecture at Denodo. He has over 20 years of experience with leading integration companies such as Progress Software, BEA Systems, and Axway.