There is no discussion about the business value of a data lake for data scientists. Everyone understands that bringing all the data together in one place makes access to data easy and quick for data scientists. Studies have shown that data scientists spend 80% of their time on data preparation. A large part of that time is wasted on gathering the data they need for analytics. A data lake reduces this waste of time and enables data scientists to start sooner with their real work: data analysis.
But must a data lake be a physical data lake? According to the original definition of a data lake the answer is: Yes. Data needed by the data scientists is copied from their original data source to the physical data lake. This is reflected clearly in how James Serna defines data lake: “A data lake is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.”
Copying and moving all the data physically to one centralized environment can lead to a wide range of insurmountable problems and challenges (see also this link):
- Big data can be too big to move and too costly to store twice
- Company politics can prohibit copying of data owned by divisions or departments to a centralized environment
- Data privacy and protection regulations can prohibit storage of specific types of data together
- Data in a data lake is stored outside its original security realm
- Metadata describing the data is commonly not copied along with the data and therefore not available to the data scientists
- Some data sources, such as old mainframe databases, can be hard to copy and to keep in its original format
- Technical and organizational management of a data lake is required
Data scientists need quick and easy access to all the data they need, but must the solution be based on one centralized physical environment? Compare this to business users of a BI environment. They don’t ask for data warehouses or data marts, they ask for reports and dashboards that show data in a form that helps them with their decision making. The data warehouse or data mart are just implementations or solutions. The same applies for data lakes. Data lakes are not what the data scientists ask for, they ask for easy and quick data access. The data lake is a solution, one possible solution.
A different and more practical solution to fulfill the needs of data scientists is the logical data lake; a system that pretends as if all the data is stored in one centralized environment, but it can, in fact, leave the data in its original source. The goal of a logical data lake is to allow data scientists to easily and quickly get to the data, and it hides where the data is physically stored and whether it has been copied or not.
Logical data lakes can be developed with data virtualization servers such as the Denodo Platform. The Denodo Platform allows a heterogeneous set of data sources to be presented as one logical database.
Developing a logical data lake doesn’t imply that the original data sources are always accessed when the data scientists run their queries. It means that copying the original data is not the default approach to make data available for analytics. Copying is only used when technical or organizational reasons dictate that a copy must be created. And if copies are made, with a data virtualization server they’re made under its control. So, if copying and physically storing the data twice is the default approach for the physical data lake, it’s optional for the logical data lake. This makes it easier for a logical data lake to deal with the problems and challenges described above, while retaining the ease by which data scientists access data.
To summarize, the logical data lake offers the best of both worlds, meaning access to data without copying if required and access to copied data when needed. The physical data lake only offers the second option. The goal of the logical data lake is to offer data scientists easy and quick access to data and not to create a large and complex data storage environment.