1.6 About Pushdown of Character Large Object (CLOB) Processing

Queries against Hadoop data may involve processing large objects with potentially millions of records. It is inefficient to return these objects to Oracle Database for filtering and parsing. Oracle Big Data SQL can provide significant performance gains by pushing CLOB processing down to its own processing cells on the Hadoop cluster. Filtering in Hadoop reduces the number of rows returned to Oracle Database. Parsing reduces the amount of data returned from a column within each filtered row.

Customers can disable or re-enable CLOB processing pushdown to suit their own needs.

In the current release, this functionality currently applies only to JSON expressions returning CLOB data. The eligible JSON filter expressions for storage layer evaluation include simplified syntax, JSON_VALUE, and JSON_QUERY.

The same support will be provided for other CLOB types (such as substr and instr) as well as for BLOB data in a future release.

Oracle Big Data SQL can push processing down to Hadoop for CLOBs within these size constraints:

  • Filtering for CLOB columns up to 1 MB in size.

    The actual amount of data that can be consumed for evaluation in the storage server may vary, depending upon the character set used.

  • Parsing for columns up to 32 KB.

    This limit refers to the select list projection from storage for the CLOB datatype.

Processing falls back to the Oracle Database only when column sizes exceed these two values.

Example 1-3 JSON Document Processing

For queries into large JSON documents, pushdown of CLOB processing to Oracle Big Data SQL processing cells in Hadoop can be highly effective. Consider the following example, where purchase orders information is stored in JSON. Assume that this record could be up to 25K in size and several millions of such records must processed.
{"ponumber":9764,"reference":"LSMITH-20141017","requestor":"Lindsey Smith","email”: “Lindsey@myco.com”, “company”:”myco” …}
You can create the external table to access this data as follows. Notice there is a single CLOB column.
CREATE TABLE POS_DATA
  ( pos_info CLOB )
  ORGANIZATION EXTERNAL
  ( TYPE ORACLE_HDFS
    DEFAULT DIRECTORY DEFAULT_DIR
    LOCATION ('/data/pos/*')
  )
 REJECT LIMIT UNLIMITED;
You can then query the data with this simple syntax:
SELECT p.pos_info.email, p.pos_info.requestor
FROM POS_DATA p
WHERE p.pos_info.company=’myco’

The query example above engages two data elimination optimizations:

  • The data is filtered by the Oracle Big Data SQL cells in the Hadoop cluster. Only records pertaining to the company “myco” are parsed (and after parsing only selected data from these records is returned to the database).

  • The Oracle Big Data SQL cells in the cluster parse the filtered set of records and from each record only the values for the two attributes requested (p.pos_info.email and p.pos_info.requestor) are returned to the database.

The table below shows some other examples where CLOB processing pushdown is supported. Remember that projections (references on the select side of the CLOB column) are limited to 32 KB of CLOB data, while predicate pushdown is limited to 1 MB of CLOB data.

Query Comment
SELECT count(*) FROM pos_data p WHERE pos_info is json; In this case, the predicate ensures that only columns which comply with JSON format are returned.
SELECT pos_info FROM pos_data p WHERE pos_info is json; The same predicate as in the previous case, but now the CLOB value is projected.
SELECT json_value(pos_info, '$.reference') FROM pos_data p WHERE json_value(pos_info, '$.ponumber') > 9000 Here, the predicate is issued on a field of the JSON document, and we also execute a JSON value to retrieve field "reference" on top of the projected CLOB JSON value.
SELECT p.pos_info.reference FROM pos_data p WHERE p.pos_info.ponumber > 9000; This is functionally the same query as the previous example, but expressed in simplified syntax.
SELECT p.pos_info.email FROM po_data p WHERE json_exists(pos_info, '$.requestor') and json_query(pos_info, '$.requestor') is not null; This example shows how json_exists and json_query can also be used as predicates.