Reading Time: 2 minutes

Data caching modes – when and how to use them.

Intelligent data caching is a key capability of an enterprise-class Data Virtualization platform and, working in tandem with real-time query optimization and scheduled batch operations, is one of the key features that differentiates the best-of-breed platforms from standard data federation platforms masquerading as data virtualization offerings. Caching can provide the right combination of high performance, low latency of information, minimum source impact, and reducing the cost of needless replication.

The Data Virtualization platform cache should support multiple mode of operation to provide the flexibility that is required for different data source loads, latencies, and volumes coupled with different consuming styles and SLAs. No two Data Virtualization scenarios are alike, so the cache needs to provide the flexibility to deal with any given scenario.

An enterprise-class Data Virtualization cache should support full and partial caching modes. Full caching is when all of the necessary data from the data source is cached and all subsequent queries are fulfilled by the cache rather than the original data source. A typical example of full caching mode being used is to cache data from a particularly slow data source prior to the start of the business day. Another use of the fill cache mode is to protect operational data sources from additional loads. If the data source is sensitive to any additional query load, the data can be queried during a quiet period and cached in the Data Virtualization platform – all subsequent queries use the cached data rather than querying the operational data source.

Partial caching mode, on the other hand, does not require you to have all of the data from the data source in the cache. When a query is executed, the Data Virtualization platform will check if the cache contains the data required to answer the query and, if the data is not in the cache, the query will be made against the original data source. Partial cache mode can be used in the same situations as for full cache mode, but where having all of the data in the cache is either impossible or impractical. Using the partial cache mode, you can pre-load the most important or frequently used data into the cache and queries requiring this data will be fulfilled from the cache. Other queries (i.e. those that cannot be satisfied from the cache) will be made against the original data sources. In fact, it is not necessary to know what data is the most important or most frequently requested…the cache can be configured to cache the results of queries fulfilled by the original data source, thereby populating the cache with previously requested data. As common queries are repeated, they will be fulfilled from the cache instead of going to the original data source.

The cache within the a Data Virtualization platform serves many purposes; managing real-time performance across disparate data sources with varying latencies, minimizing the movement of data based on frequent query patterns, reducing or managing the impact of data virtualization on source systems, and protecting against intermittent source system availability. Its flexibility in addressing these different requirements comes from the variety of operating modes and options that allow you to configure the caching to suit your needs. Careful use of the cache within your Data Virtualization platform can dramatically affect the performance and scalability of both the platform and your underlying source systems.

Paul Moxon