Why 80% of data lakes fail and what you can do about it.
The issues and implications of Brexit are all over the news … regionalism vs globalism, independence vs shared governance and regulation, closed vs open borders … and they don’t need to be debated again here. This blog is to explore the strong parallels in designing data lake architectures and how to avoid the perils of isolation in the enthusiasm for building data lakes.
According to Gartner, “through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient. Also Gartner predicts, “Through 2018, 70% of Hadoop deployments will fail to meet cost savings and revenue generation objectives due to skills and integration challenges.” In other words, 80% of data lakes will fail without a strong semantic layer and integration with other data silos.
The solution is simple – every data lake project should include data virtualization in its architecture. This creates a logical data warehouse (for analytics) or virtual data services layer (for operations and self-service information access) or both. This combination works. It enables data lakes to achieve low-cost storage and analytics of large unstructured data volumes leveraging open source technologies such as Hadoop; at the same time it allows business users to understand and use meaningful information combining traditional and big data effectively for business goals.
You might ask: why do I need data virtualization and not have all my data go through a data lake and use that instead? There are many reasons why data virtualization is a good idea. Here are the top four that our customers say are the most beneficial:
1. Semantic and Security Layers on top of a Data Lake
Semantic and security layers on top of a data lake increases value and benefit to a wide range of users. A lot of data lakes focus just on ingestion and storage and don’t think through how the data will be understood or accessed. This limits usage to a few data scientists and not the broader use that was originally envisioned. Data virtualization provides several capabilities to enhance the data lake as experienced by broad range of users:
- Semantic layer – easy to organize and access the data as canonical business views.
- Security layer – several big data technologies don’t have robust security. DV avoids losing control of the data through authentication, audit, logging of all access.
- Discovery layer – semantic layer in turn facilitates self-service discovery of data assets by entity, name, attribute, etc.
- Sharing layer – large data set are now repurposed for sharing in small chunks, different formats such as REST, RSS feed, etc.
2. Minimize Data Movement using Logical Data Warehouse (LDW)
While it seems feasible to move all data to data lakes (since Hadoop is cheaper), it still is not recommended to put ALL your data in a data lake. Every time you replicate data it adds both direct and indirect costs, and worse it impacts your agility negatively. Instead, use the LDW to provide an integrated view of structured and unstructured data across silos while delivering high performance. Today’s leading data virtualization platforms use Dynamic Performance Optimization to move the processing to where the data is resident, thus leveraging the power of underlying big data platforms. This frees you up to put only the data you need to in big data lakes, and leave the rest where it best belongs …historical, time series, unstructured, operational, external data, etc…. and learn to leverage data “where it is”.
3. Accommodate Dynamic Business Requirements
Dynamic business requirements are met faster with data virtualization’s capabilities to integrate new sources and new technologies.
- New sources of data – much faster to include in LDW than having to move it first to a data lake. This provides time to market advantage. It also allows supports temporary or transient use cases more responsively.
- New technologies – your data lake may be on one flavor or big data technology today, but change rapidly as new technologies emerge. Data virtualization helps to abstract these changes to the user and allows the big data / analytics stack to evolve while delivering business value all through the cycle. Significantly it also avoids vendor lock-in and flexibility to leverage pay-as-you-go cloud models.
4. Accelerate Transition from Raw Data to Actionable Insights
Data virtualization facilitates exploration of raw data assets, operationalization of analytics functions, and democratization of analytics for process change.
- Data Scientists – exploration phase: their job is to troll vast amounts of data to find new insights. So having access to an enterprise data catalog with rich metadata facilitates better understanding of where to find those insights during the experimental stage of Big Data Analytics. Data Virtualization helps the creation of multiple sandboxes for data scientists.
- Business Analysts – perationalization phase: when insights have the potential to impact business positively, time to market on operationalizing it becomes important. Data virtualization helps to operationalize and govern execution of analytic functions and processes in real-time by exploiting data lakes alongside other data stores, as a data processing environment. This is accomplished through dynamic query pushdown, thus leveraging either in-database or in-cluster processing.
- Business Users – realization phase: In this phase the analytics – reactive, prescriptive, and predictive – are made actionable by democratizing the analytics output for multiple persona and downstream applications for broader consumption in various output formats including mobile friendly format such as REST and OData.
Big data lakes and data virtualization go together and avoid the perils of a data lakexit down the road. A Logical Data Warehouse strategy is highly recommended by leading analysts and consultants with experience, and has been proven effective in several leading companies.