Data Virtualization Hadoop Ecosystem
Reading Time: 6 minutes

The Hadoop ecosystem is here to stay, for long. Having been one of the most important big data enablers in the recent years and in position to continue being so in the years to come, Hadoop is nowadays one of the key target data sources for general data integration systems such as data virtualization platforms. But Hadoop isn’t just a database or a piece of software. It is a complex ecosystem of highly heterogeneous software living in a distributed data environment – almost an entire Operating System by itself. So, integrating Hadoop as a data source introduces a series of challenges that most other systems don’t; we can start by asking ourselves what does integrating Hadoop as a data source mean exactly.

Let’s approach the scenario by defining a series of discrete integration points, which we will classify into two groups: basic and specialized. These integration points will give the data virtualization platform the ability to use specific parts of a Hadoop installation as separate data sources, each with its own specificities and capabilities.

Basic integration points

One of the two most important components of Hadoop’s core is the Hadoop Distributed File System (HDFS). HDFS is a file system separate from the host operating system’s, maintained in a way such that all files stored there are distributed among the different nodes in the Hadoop cluster. HDFS is extremely relevant to data virtualization (DV) platforms because it’s where all data is stored independently on the specific Hadoop-enabled pieces of software that may be making use of that data. If we have custom MapReduce tasks being executed, they will output their result into HDFS files; if we use Hive or HBase, they will be storing their data on HDFS; if we make use of any Hadoop scripting languages, it will be acting on HDFS files. That’s where data is, and we might need to access it one way or another just as we might need to directly access files on local or remote folders of our DV platform’s host system, whichever the format or the piece of software that authored those files.

Besides a command line, HDFS offers a binary API library that can help data virtualization platforms access data inside it. This API gives support for different types of files in the file system, as well as operations that can be performed on them. But it is a binary API, so by using it we are establishing a hard link, a dependency, between our software and Hadoop’s own APIs. This is not a big issue when developing in-house custom-tailored solutions, but it poses an important challenge when creating generic data integration tools that should work in a variety of different scenarios out-of-the-box. Also, data virtualization systems are normally deployed remotely to the Hadoop cluster, so many times the benefits of using high-performance binary libraries like these can be limited compared to the use of more standard interfaces.

Most Hadoop installations also allow accessing HDFS and its operations via REST APIs, mainly by means of two interfaces called WebHDFS and HttpFS. There are differences in concept between the two (e.g. WebHDFS will redirect clients to the specific nodes where the data lives whereas HttpFS can act as a single-server proxy) and this might make us prefer one over the other depending on the specific scenario. However, in general they are interoperable, they offer really good performance and, most importantly, provide external/remote data integration software, like DV, with a standard and decoupled interface for accessing data on top of the HTTP protocol.

So, once access to the file system is sorted, what other basic/core integrations might we need? Security, of course!

Security in Hadoop is almost an industry by itself. There is a wide range of solutions for data encryption, authentication and authorization in Hadoop systems, as it should be, since we are talking about protecting large amounts of potentially sensitive data and the processes that deal with it. Over the years, the different enterprise-grade Hadoop distribution vendors have been adding their own contributions to the ecosystem, and security is one of the fields where this innovation (and heterogenization) has been more relevant.

Nevertheless, from the standpoint of a data virtualization platform willing to integrate Hadoop (or parts of Hadoop) as a data source, we normally approach Hadoop security as external/remote clients, and therefore most of the encryption and authorization mechanisms should become transparent to us. This means we can focus on authentication, and specifically on what is the most widespread authentication mechanism across almost every Hadoop service: Kerberos.

By setting up a Key Distribution Center or KDC, Kerberos is able to protect all user passwords in a Hadoop installation in a centralized manner, but this requires Hadoop clients to be prepared to talk Kerberos language during authentication, that is, obtain and manage specific authentication artifacts (tickets) and also send them to the kerberized services inside Hadoop. This requires the presence of Kerberos client software, which the data virtualization system will integrate (or at least talk to) in order to be able to establish secure communication channels with Hadoop services.

In the case of HTTP REST APIs, like the WebHDFS and HttpFS already mentioned, a specific mechanism called Kerberos SPNEGO is usually available and can be used from the data virtualization platform to access these services.

With HDFS and security, we have our basics covered: we can access data in a Hadoop installation in a secure manner. But actually, most Hadoop installations don’t simply operate on custom-developed MapReduce tasks that output files on HDFS. Instead, more complex Hadoop-enabled software is run on top of the Hadoop core that can perform data storage, query and analysis in more efficient ways, and which data virtualization platforms can use as their data sources instead of directly accessing HDFS. That’s the point at which we leave the Hadoop core and start talking about specialized integration points, i.e. integration with specific data services running on the Hadoop systems.

Specialized integration points

Hadoop is a very healthy and prolific ecosystem, and there are very large amounts of different data-oriented tools that can be run on top of the Hadoop core. From the standpoint of data virtualization platforms, many of these tools could be used as data sources, but their heterogeneity is so that each one of them has to be studied separately. From the data consumer’s standpoint, there is no such thing as “Integration with Hadoop”, but rather “Integration with Hadoop’s X service”.

Let’s comment briefly on the two most popular Hadoop data services: Apache HBase and Apache Hive.

Apache HBase is a NoSQL data store that runs on top of HDFS. Its key aspect is that it offers random, real-time access to data living in HDFS (something HDFS cannot provide by itself). It has the form of a key-value data store modeled after Google’s BigTable design and offers a variety of access methods to its data depending on the specific Hadoop distribution, from binary API libraries to REST interfaces and other options, making use of diverse security mechanisms (mainly based on Kerberos authentication).

From the standpoint of a data virtualization platform, accessing HBase will normally mean remote access. For this, REST APIs could be a good, standard and decoupled choice, but for performance or architectural reasons, we might prefer to go for the binary APIs. In such case, as with direct HDFS access, we will be facing high-coupling of our code with specific versions of these binary APIs, so we might be in fact trading easiness of maintenance for a bit performance. Note that HBase is not a relational data store, so it doesn’t (directly) offer any standard SQL-based interfaces that we can remotely access via standard APIs like JDBC or ODBC.

Apache Hive is a data query and analysis tool for large datasets. It works on top of HDFS and provides a SQL-like interface on top of it, which is ideal for easy querying from a data virtualization platform. Furthermore, Hive doesn’t only work directly on HDFS-stored datasets – in fact it can also make use of existing HBase infrastructure, leveraging Hive’s powerful and flexible data analytics tools on data already stored and/or processed on HBase. Data virtualization platforms can easily access Hive services through standard JDBC or ODBC drivers, authenticating via Kerberos and integrating data coming from the Hadoop installation almost as if it were any other relational DBMS.

However, HBase and Hive are only two (very popular) examples. The amount of Hadoop data services that can be used as DV data sources is large, and there are many other popular software packages, like Apache Phoenix, Cloudera Impala, Pivotal HAWQ, MapR-DB, etc. In almost every case, we will find the DV system acting as a remote client and retrieving data through either binary libraries, REST-based APIs or SQL APIs like JDBC, using Kerberos authentication. The chosen interfaces will, for each case, determine maintainability, performance and development effort for each of these integrations, and it will be a matter of adopting the methods and architectures, for each scenario, that will help the data virtualization platform extract the most value out of the Hadoop Big Data system.

Denodo Labs