1.4 About Storage Indexes

For data stored in HDFS, Oracle Big Data SQL maintains Storage Indexes automatically, which is transparent to Oracle Database. Storage Indexes contain the summary of data distribution on a hard disk for the data that is stored in HDFS. Storage Indexes reduce the I/O operations cost and the CPU cost of converting data from flat files to Oracle Database blocks. You can think of a storage index as a "negative index". It tells Smart Scan that data does not fall within a block of data, which enables Smart Scan to skip reading that block. This can lead to significant I/O avoidance.

Storage Indexes can be used only for external tables that are based on HDFS and are created using either the ORACLE_HDFS driver or the ORACLE_HIVE driver. Storage Indexes cannot be used for external tables based on object stores or external tables that use StorageHandlers, such as Apache HBase and Oracle NoSQL.

A Storage Index is a collection of in-memory region indexes, and each region index stores summaries for up to 32 columns. There is one region index for each split. The content stored in one region index is independent of the other region indexes. This makes them highly scalable, and avoids latch contention.

Storage Indexes maintain the minimum and maximum values of the columns of a region for each region index. The minimum and maximum values are used to eliminate unnecessary I/O, also known as I/O filtering. The cell XT granule I/O bytes saved by the Storage Indexes statistic, available in the V$SYSSTAT view, shows the number of bytes of I/O saved using Storage Indexes.

Queries using the following comparisons are improved by the Storage Indexes:

  • Equality (=)

  • Inequality (<, !=, or >)

  • Less than or equal (<=)

  • Greater than or equal (>=)

  • IS NULL

  • IS NOT NULL

Storage Indexes are built automatically after Oracle Big Data SQL service receives a query with a comparison predicate that is greater than the maximum or less than the minimum value for the column in a region.

Note:

  • The effectiveness of Storage Indexes can be improved by ordering the rows in a table based on the columns that frequently appear in the WHERE query clause.

  • Storage Indexes work with any non-linguistic data type, and works with linguistic data types similar to non-linguistic index.

Example 1-1 Elimination of Disk I/O with Storage Indexes

The following figure shows a table and region indexes. The values in column B in the table range from 1 to 8. One region index stores the minimum 1, and the maximum of 5. The other region index stores the minimum of 3, and the maximum of 8.

For a query such as the one below, only the first set of rows match. Disk I/O is eliminated because the minimum and maximum of the second set of rows do not match the WHERE clause of the query.
SELECT * 
FROM TABLE 
WHERE B < 2;

Example 1-2 Using Storage Indexes and Bloom Filters

Using Storage Indexes allows table joins to skip unnecessary I/O operations. For example, the following query would perform an I/O operation and apply a Bloom filter to only the first block of the fact table. Bloom filters are the key to improved join performance. In the example, a predicate is on the dimension table - not the fact table. The Bloom Filter is created based on "dim.name=Hard drive" and this filter is then applied to the fact table. Therefore, even though the filter is on the dimension table, you are able to filter the data at its source (i.e. Hadoop) based on the results of the dimension query. This also enables optimizations like Storage Indexes to engage.

SELECT count(*) 
FROM fact, dimension dim  
WHERE fact.m=dim.m and dim.product="Hard drive";

The I/O for the second block of the fact table is completely eliminated by Storage Indexes as its minimum/maximum range (5,8) is not present in the Bloom filter.