Unlocking the Potential of Machine Learning in a Data Lake
Reading Time: 3 minutes

With data becoming the brain food to the intelligence of every organization, regardless of size or sector, it has become crucial to harness this data to achieve the best results, make the most informed decisions and improve productivity. However, with every action, reaction and interaction a fresh load of data is produced, resulting in an avalanche of information.

Managing the Avalanche

It becomes key to store and manage all the data of interest – both unstructured and structured – in one central repository. This repository, more commonly referred to as a data lake, has become the principal data management architecture for data scientists.

The benefits of a data lake are threefold:

  • They make data discovery easier
  • They reduce the time spent by data scientists on selection and integration
  • They provide massive computing power, allowing data to be efficiently transformed and combined to meet needs of any process that requires it.

A recent analyst report confirmed the success of the data lake discovering that those employing this architecture were outperforming their peers by 9% in organic revenue growth.

Perhaps one of the main advantages of the data lake, especially for organizations interested in getting ahead of the competition, are the machine learning capabilities. By using machine learning to analyze the historical data stored, businesses can glean sufficient intelligence and insight to forecast likely outcomes and work out how to achieve the best results for employee productivity, processes and so on.

The Downside

Here’s the but… despite all these benefits, businesses continue to struggle with certain aspects of data delivery and integration. In fact, research shows that data scientists can spend up to 80 per cent of their time on these tasks – not the most efficient way of working!

So why are they struggling? First, unfortunately, storing data in its original form does not remove the need to adapt it later for machine learning processes, and this can become really complex. Over the last few years, data preparation tools have emerged specifically to try and make simple integration tasks more accessible to data scientists. These tools however, are limited as they cannot help data scientists with more complex tasks that require a more advanced skillset. In these instances, an organization’s IT department is often called upon to create new data sets in the lake specifically for machine learning purposes, which of course slows down progress.

Furthermore, having all your data in the same physical place doesn’t exactly make the discovery part easy. Think about it, it’s like the modern-day, digital equivalent of finding a needle in a haystack. In addition, big companies today have hundreds of repositories distributed on-premise platforms, data centers, cloud providers and so on. It’s therefore it’s not surprising that only a small subset of all relevant data is actually copied to the lake.

So, What’s the Solution?

Ultimately, these issues with delivery and integration need to be addressed for organizations to unlock the full benefits of the data lake. Step forward, data virtualization.

Regardless of where your data is located or the format it is in, data virtualization provides a single access point by stitching together data abstracted from various underlying sources and delivering it to the consuming applications in real time. This way, even data that has still not been copied to the lake is available for data scientists.

In addition, it also helps to address other challenges faced by data scientists:

  • Data discovery: Data virtualization provides a single point to expose all available data to the consumers. Data virtualization is user-friendly, especially those tools with data cataloging capabilities which allow data scientists to search and browse all the data sets available. The technology liberates users and organizations alike by democratizing the data and providing a fast, cost effective way to access it
  • Data integration: The data is organized according to a consistent data representation and query model, meaning regardless of where the data is originally stored, data scientists can view all their data as if it were stored in the same place. It’s possible to make reusable logical data sets which can be adapted to meet the needs of each individual machine learning process, taking the pain out of data integration and preparation for data scientists.

Improving the Productivity of Data Scientists

The machine learning market is expected to grow by 44 per cent over the next four years, as companies seek ever more meaningful insight. As businesses continue to look to modern analytics and machine learning as a means of improving their operational efficiency, the need for technologies like data virtualization will also grow.

By enabling data scientists to discover and integrate data with ease, data virtualization can support them in exposing the results of machine learning analysis, and opening the door to a whole new world of possibilities for driving real business value from a wealth of data.

Useful Resources

Alberto Pan