Use Oracle Big Data SQL to Access Data

3 Use Oracle Big Data SQL to Access Data

Oracle Big Data SQL enables you to query diverse data sources using the full power of Oracle SQL SELECT statements.

You can create Oracle Big Data SQL enabled external tables over data from Hive, Hadoop, Apache Kafka, Oracle NoSQL Database, and object stores.

Topics

3.1 About Creating External Tables

Oracle Big Data SQL enables you to query external data through Oracle Big Data SQL enabled external tables from the Oracle Database using the full power of Oracle SQL SELECT statements. It also enables you to write queries that join Oracle tables with external data, leverage robust Oracle Database security features, and take advantage of advanced SQL capabilities like analytic functions, JSON handling, and others.

The SQL CREATE TABLE statement has a clause specifically for creating external tables. The information that you provide in this clause enables the access driver to read data from an external source and prepare the data for the external table.

Note:

If you create Oracle external tables with mis-matched column names for a data source file that has its own metadata (eg, Parquet files), the query will return NULL for those columns. For example, if a Parquet file has columns (colA, colB, colC), and the second Parquet file has columns (colA, colB, colD), and you create Oracle external tables with columns (colA, colB, colC, colXX), the colXX column will always return NULL for Oracle queries. Any queries with colXX as predicates (eg, .. where colXX = 'YES') will return 0 rows (unless you specify the predicate of "where colXX is NULL").

3.1.1 Basic `Create Table` Syntax

The following is the basic syntax of the CREATE TABLE statement for external tables:

CREATE TABLE table_name (column_name datatype, 
                         column_name datatype[,...]) 
   ORGANIZATION EXTERNAL (external_table_clause);

You specify the column names and data types the same as for any other table. ORGANIZATION EXTERNAL identifies the table as an external table.

The external_table_clause identifies the access driver and provides the information that it needs to load the data. See "About the External Table Clause".

3.1.2 About the External Table Clause

CREATE TABLE ORGANIZATION EXTERNAL statement takes the external table clause as its argument. The external table clause has the following subclauses:

TYPE

The TYPE clause identifies the access driver. The type of access driver determines how the other parts of the external table definition are interpreted.

Specify one of the following values for Oracle Big Data SQL:

ORACLE_HDFS: Accesses files in an HDFS directory.
ORACLE_HIVE: Accesses a Hive table.
ORACLE_BIGDATA: Accesses files in an object store.

Note:

The ORACLE_DATAPUMP and ORACLE_LOADER access drivers are not associated with Oracle Big Data SQL.

DEFAULT DIRECTORY

The DEFAULT DIRECTORY clause identifies an Oracle Database directory object. The directory object identifies an operating system directory with files that the external table reads and writes.

ORACLE_HDFS, ORACLE_BIGDATA, and ORACLE_HIVE use the default directory solely to write log files on the Oracle Database system.

LOCATION

The LOCATION clause identifies the data source.

REJECT LIMIT

The REJECT LIMIT clause limits the number of conversion errors permitted during a query of the external table before Oracle Database stops the query and returns an error.

Any processing error that causes a row to be rejected counts against the limit. The reject limit applies individually to each parallel query (PQ) process. It is not the total of all rejected rows for all PQ processes.

ACCESS PARAMETERS

The ACCESS PARAMETERS clause provides information that the access driver needs to load the data correctly into the external table.

3.2 Create an External Table for Hive Data

You can leverage Hive metadata when creating your Oracle Big Data SQL external tables.

External table creation is a technique to access data not only data in HDFS, but also data in other storage types, including Oracle NoSQL Database, Apache Kafka, HBase, and object stores.

To enable Oracle Big Data SQL to query Hive data, you must first define an Oracle external table for your Hive data. There are a number of tools available to help you create the Oracle external table definition. These tools leverage the underlying hive metadata, making it easier to create the tables.

Oracle Big Data SQL now supports erasure coding (EC) with CDH 6.1 (and above), HDP, CDP and Hive transactional tables.

The external table provides a level of abstraction. The underlying partitions or file type may change, but the Oracle external table definition can remain the same. It automatically picks up these changes when you query the table.

As part of an external table definition, you specify the table columns and their data types as well as a pointer to the source table in Hive. The rest of the metadata is derived at query execution time, including the data location, the file type and partitioning information.

DBMS_HADOOP

DBMS_HADOOP is a PL/SQL package that contains the CREATE_EXTDDL_FOR_HIVE procedure. This procedure generates the DDL to create an Oracle external table for a given Hive table. You can optionally edit the text of the generated DDL before execution in order to customize the external table properties.This procedure enables you to easily automate the definition of many tables at one time.
The Big Data SQL wizard in Oracle SQL Developer

The most recent versions of the free Oracle SQL Developer tool include a Big Data SQL wizard that guides you easily through the process of creating Oracle external table definitions.

If you have a configured Hive connection in Oracle SQL Developer, then in the Connections navigator, drill down from the connection entry point to a Hive table and do the following:
1. Right-click on the table icon and select Use in Oracle Big Data SQL...
2. When prompted, select an Oracle Database connection for the import of the Hive table.
3. Select an Oracle Big Data SQL-enabled target database.
4. In the Create Table dialog, check over the current configuration for columns, external table properties, and storage. Modify as needed. You can also preview the text of the DDL that will be generated.
5. Click OK when you are satisfied with the table definition. The wizard will create the external table at the designated location.
The Oracle SQL Developer Data Modeler

This is free graphical design tool that you can use to connect to a Hive metastore and generate an external table. You can select and import one or multiple Hive tables, modify table properties as needed, and then generate the DDL that you can copy into an SQL Worksheet and then run in order to create an Oracle external table. Although the Data Modeler is a more complex tool to use than the other options, its advantage is that you can use it to work on multiple Hive tables

See Oracle SQL Developer & Data Modeler Support for Oracle Big Data SQL in the Oracle Blog space for a demonstration of how to use the Data Modeler.

See Also:

For instructions on how to install Oracle SQL Developer and connect to Hive in order to create external tables, see Use Oracle SQL Developer to Connect to Hive.

3.2.1 Obtain Information About a Hive Table

The DBMS_HADOOP PL/SQL package contains a function named CREATE_EXTDDL_FOR_HIVE. It returns the data dictionary language (DDL) to create an external table for accessing a Hive table. This function requires you to provide basic information about the Hive table:

Name of the Hadoop cluster
Name of the Hive database
Name of the Hive table
Whether the Hive table is partitioned

You can obtain this information by querying the ALL_HIVE_TABLES data dictionary view. It displays information about all Hive tables that you can access from Oracle Database.

This example shows that the current user has access to an unpartitioned Hive table named RATINGS_HIVE_TABLE in the default database. A user named JDOE is the owner.

SQL> SELECT cluster_id, database_name, owner, table_name, partitioned FROM all_hive_tables;
CLUSTER_ID   DATABASE_NAME  OWNER    TABLE_NAME         PARTITIONED
------------ -------------- -------- ------------------ --------------
hadoop1      default        jdoe     ratings_hive_table  UN-PARTITIONED

See Also:

"Static Data Dictionary Views for Hive"

3.2.2 Use the `CREATE_EXTDDL_FOR_HIVE` Function

With the information from the data dictionary, you can use the CREATE_EXTDDL_FOR_HIVE function of DBMS_HADOOP. This example specifies a database table name of RATINGS_DB_TABLE in the current schema. The function returns the text of the CREATE TABLE command in a local variable named DDLout, but does not execute it.

DECLARE 
   DDLout VARCHAR2(4000);
BEGIN
   dbms_hadoop.create_extddl_for_hive(
      CLUSTER_ID=>'hadoop1',
      DB_NAME=>'default',
      HIVE_TABLE_NAME=>'ratings_hive_table',
      HIVE_PARTITION=>FALSE,
      TABLE_NAME=>'ratings_db_table',
      PERFORM_DDL=>FALSE,
      TEXT_OF_DDL=>DDLout
   );
   dbms_output.put_line(DDLout);
END;
/

When this procedure runs, the PUT_LINE function displays the CREATE TABLE command:

CREATE TABLE ratings_db_table (   
   c0 VARCHAR2(4000),
   c1 VARCHAR2(4000),
   c2 VARCHAR2(4000),
   c3 VARCHAR2(4000),
   c4 VARCHAR2(4000),
   c5 VARCHAR2(4000),
   c6 VARCHAR2(4000),
   c7 VARCHAR2(4000))
ORGANIZATION EXTERNAL
   (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR 
   ACCESS PARAMETERS
      (
       com.oracle.bigdata.cluster=hadoop1
       com.oracle.bigdata.tablename=default.ratings_hive_table
      )
   ) PARALLEL 2 REJECT LIMIT UNLIMITED

You can capture this information in a SQL script, and use the access parameters to change the Oracle table name, the column names, and the data types as desired before executing it. You might also use access parameters to specify a date format mask.

The ALL_HIVE_COLUMNS view shows how the default column names and data types are derived. This example shows that the Hive column names are C0 to C7, and that the Hive STRING data type maps to VARCHAR2(4000):

SQL> SELECT table_name, column_name, hive_column_type, oracle_column_type FROM all_hive_columns;
 
TABLE_NAME            COLUMN_NAME  HIVE_COLUMN_TYPE ORACLE_COLUMN_TYPE
--------------------- ------------ ---------------- ------------------
ratings_hive_table    c0           string           VARCHAR2(4000)
ratings_hive_table    c1           string           VARCHAR2(4000)
ratings_hive_table    c2           string           VARCHAR2(4000)
ratings_hive_table    c3           string           VARCHAR2(4000)
ratings_hive_table    c4           string           VARCHAR2(4000)
ratings_hive_table    c5           string           VARCHAR2(4000)
ratings_hive_table    c6           string           VARCHAR2(4000)
ratings_hive_table    c7           string           VARCHAR2(4000)
 
8 rows selected.

See Also:

"DBMS_HADOOP PL/SQL Package"

3.2.3 Use Oracle SQL Developer to Connect to Hive

Oracle SQL Developer provides methods to connect to a Hive metastore and create Oracle external tables over Hive.

Follow these steps to set up Oracle SQL Developer to work with Oracle Big Data SQL.

Install Oracle SQL Developer
Download the Hive JDBC Drivers
Add the new Hive JDBC Drivers to Oracle SQL Developer
Create a database connection to Hive.

Installing Oracle SQL Developer

Install Oracle SQL Developer 4.2 or greater. Starting with this version, support is included for Copy To Hadoop, a useful Oracle Big Data SQL tool for off-loading Oracle Database tables to HDFS.

The installation is simple. Just download the package and extract it.

Go to the Oracle SQL Developer download site on the Oracle Technology Network (OTN).
Accept the license agreement and download the version that is appropriate for your platform.

For most users, Windows 64–bit with JDK 8 included is the correct choice.
Extract the downloaded ZIP file to your local drive.

You can extract to any folder name.

See Installing and Getting Started with SQL Developer in the Oracle SQL Developer User’s Guide for further installation and configuration details.

Downloading and Installing the Hive JDBC Drivers for Cloudera Enterprise

To connect Oracle SQL Developer to Hive in the Hadoop environment, you need to download and install the Hive JDBC drivers for Cloudera Enterprise. These drivers are not included in the Oracle SQL Developer download package.

Note for HDP Users:

At this time, SQL Developer 4.2 requires the Cloudera JDBC drivers for Hive. However, these drivers appear to work against Hortonworks clusters as well. HDP users should test to determine if these drivers meet their needs.

Download the latest Cloudera JDBC drivers for Hive from the Cloudera website to any local directory.

You can search for “cloudera hive jdbc drivers download” on the Cloudera website to locate the available driver packages.

You are prompted to select the driver version, OS, and OS version (32/64 bit). At this time, the latest drive version is 2.5.18. You can choose the newest version available.
Unzip the archive:
```
unzip hive_jdbc_<version>.zip
```
View the extracted content. Notice that under the top-level folder there are multiple ZIP files. Each is for a different JDBC version. For this setup, only JBDC 4.0 is usable. Select the JDBC4_ ZIP file (JDBC4_<version>.zip).

Important:
Choose only the JDBC4_ ZIP file, which contains the drivers for JDBC 4.0. This is the only compatible version. The drivers in other packages, such as JDBC41_*, are not compatible with SQL Developer 4.2 and will return errors upon connection attempts.

Unzip the JDBC4 archive to a target directory that is accessible to Oracle SQL Developer, for example, ./home/oracle/jdbc :

# unzip Cloudera_HiveJDBC4_<version>.zip -d /home/oracle/jdbc/

The extracted content should be similar to this:

Cloudera_HiveJDBC4_2.5.18.1050\Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf
Cloudera_HiveJDBC4_2.5.18.1050\Cloudera-JDBC-Driver-for-Apache-Hive-Release-Notes.pdf
Cloudera_HiveJDBC4_2.5.18.1050\commons-codec-1.3.jar
Cloudera_HiveJDBC4_2.5.18.1050\commons-logging-1.1.1.jar
Cloudera_HiveJDBC4_2.5.18.1050\HiveJDBC4.jar
Cloudera_HiveJDBC4_2.5.18.1050\hive_metastore.jar
Cloudera_HiveJDBC4_2.5.18.1050\hive_service.jar
Cloudera_HiveJDBC4_2.5.18.1050\httpclient-4.1.3.jar
Cloudera_HiveJDBC4_2.5.18.1050\httpcore-4.1.3.jar
Cloudera_HiveJDBC4_2.5.18.1050\libfb303-0.9.0.jar
Cloudera_HiveJDBC4_2.5.18.1050\libthrift-0.9.0.jar
Cloudera_HiveJDBC4_2.5.18.1050\log4j-1.2.14.jar
Cloudera_HiveJDBC4_2.5.18.1050\out.txt
Cloudera_HiveJDBC4_2.5.18.1050\ql.jar
Cloudera_HiveJDBC4_2.5.18.1050\slf4j-api-1.5.11.jar
Cloudera_HiveJDBC4_2.5.18.1050\slf4j-log4j12-1.5.11.jar
Cloudera_HiveJDBC4_2.5.18.1050\TCLIServiceClient.jar
Cloudera_HiveJDBC4_2.5.18.1050\zookeeper-3.4.6.jar

Add the new Hive JDBC Drivers to Oracle SQL Developer

Next, start up SQL Developer and copy all of the extracted driver files into “Third Party JDBC Drivers” in the Preferences window.

Start SQL Developer.
In the SQL Developer menu bar, select Tools>Preferences.
In the file explorer of the Preferences window, expand Database and then click Third Party JDBC Drivers.
Click Add Entry.
Navigate to the folder where you sent the files extracted from Cloudera_HiveJDBC4_<version>.zip. Copy all of the JAR files from the ZIP extraction into this window and then click OK.
Restart Oracle SQL Developer.

Create a Database Connection to Hive

After the drivers are installed, you can create a connection to Hiveserver2.

If you are creating a Kerberos-secured connection, you will need a user ID, the Kerberos connection parameters, and the number of the port where Hiveserver2 is running on the Hadoop system (typically, port 10000). A keytab must exist for the user.

If you're not using Kerberos, you will need a user ID (the oracle user or a user with equivalent privileges), the account password, and the Hiveserver2 port number.

See Create/Edit/Select Database Connection in the Oracle SQL Developer User’s Guide for a explanation of the fields in the Oracle and Hive tabs in the New/Select Database Connection dialog.

3.2.4 Develop a `CREATE TABLE` Statement for `ORACLE_HIVE`

Whichever method you use to create an Oracle external table over Hive (DBMS_HADOOP, Oracle SQL Developer Data Modeler, Oracle Big Data Wizard in Oracle SQL Developer, or manual coding), you may need to set some access parameters to modify the default behavior of ORACLE_HIVE.

Note:

Do not include the LOCATION clause with ORACLE_HIVE. It raises an error. The data is stored in Hive, and the access parameters and the metadata store provide the necessary information.

3.2.4.1 Use the Default `ORACLE_HIVE` Settings

The following statement creates an external table named ORDER to access Hive data:

CREATE TABLE order (cust_num    VARCHAR2(10), 
                    order_num   VARCHAR2(20), 
                    description VARCHAR2(100),
                    order_total NUMBER (8,2)) 
   ORGANIZATION EXTERNAL (TYPE  oracle_hive);

Because no access parameters are set in the statement, the ORACLE_HIVE access driver uses the default settings to do the following:

Connects to the default Hadoop cluster.
Uses a Hive table named order. An error results if the Hive table does not have fields named CUST_NUM, ORDER_NUM, DESCRIPTION, and ORDER_TOTAL.
Sets the value of a field to NULL if there is a conversion error, such as a CUST_NUM value longer than 10 bytes.

3.2.4.2 Override the Default `ORACLE_HIVE` Settings

You can set properties in the ACCESS PARAMETERS clause of the external table clause, which override the default behavior of the access driver. The following clause includes the com.oracle.bigdata.overflow access parameter. When this clause is used in the previous example, it truncates the data for the DESCRIPTION column that is longer than 100 characters, instead of throwing an error:

(TYPE oracle_hive
 ACCESS PARAMETERS (
    com.oracle.bigdata.overflow={"action:"truncate", "col":"DESCRIPTION""} ))

The next example sets most of the available parameters for ORACLE_HIVE:

CREATE TABLE order (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_date DATE,
                    item_cnt NUMBER,
                    description VARCHAR2(100),
                    order_total (NUMBER(8,2)) ORGANIZATION EXTERNAL 
  (TYPE oracle_hive
     ACCESS PARAMETERS (
        com.oracle.bigdata.tablename:  order_db.order_summary
        com.oracle.bigdata.colmap:     {"col":"ITEM_CNT", \
                                        "field":"order_line_item_count"}
        com.oracle.bigdata.overflow:   {"action":"TRUNCATE", \
                                        "col":"DESCRIPTION"}
        com.oracle.bigdata.erroropt:   [{"action":"replace", \
                                         "value":"INVALID_NUM" , \
                                         "col":["CUST_NUM","ORDER_NUM"]} ,\
                                        {"action":"reject", \
                                         "col":"ORDER_TOTAL}
))

The parameters make the following changes in the way that the ORACLE_HIVE access driver locates the data and handles error conditions:

com.oracle.bigdata.tablename: Handles differences in table names. ORACLE_HIVE looks for a Hive table named ORDER_SUMMARY in the ORDER.DB database.
com.oracle.bigdata.colmap: Handles differences in column names. The Hive ORDER_LINE_ITEM_COUNT field maps to the Oracle ITEM_CNT column.
com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated.
com.oracle.bigdata.erroropt: Replaces bad data. Errors in the data for CUST_NUM or ORDER_NUM set the value to INVALID_NUM.

See Also:

The section CREATE TABLE ACCESS PARAMETERS Clause provides the complete list of access parameters for ORACLE_HIVE, ORACLE_HDFS, and ORACLE_BIGDATA.

3.2.5 Hive to Oracle Data Type Conversions

When the access driver loads data into an external table, it verifies that the Hive data can be converted to the data type of the target column. If they are incompatible, then the access driver returns an error. Otherwise, it makes the appropriate data conversion.

Hive typically provides a table abstraction layer over data stored elsewhere, such as in HDFS files. Hive uses a serializer/deserializer (SerDe) to convert the data as needed from its stored format into a Hive data type. The access driver then converts the data from its Hive data type to an Oracle data type. For example, if a Hive table over a text file has a BIGINT column, then the SerDe converts the data from text to BIGINT. The access driver then converts the data from BIGINT (a Hive data type) to NUMBER (an Oracle data type).

Performance is better when one data type conversion is performed instead of two. The data types for the fields in the HDFS files should therefore indicate the data that is actually stored on disk. For example, JSON is a clear text format, therefore all data in a JSON file is text. If the Hive type for a field is DATE, then the SerDe converts the data from string (in the data file) to a Hive date. Then the access driver converts the data from a Hive date to an Oracle date. However, if the Hive type for the field is string, then the SerDe does not perform a conversion, and the access driver converts the data from string to an oracle date. Queries against the external table are faster in the second example, because the access driver performs the only data conversion.

The table below identifies the data type conversions that ORACLE_HIVE can make when loading data into an external table.

Table 3-1 Supported Hive to Oracle Data Type Conversions

Hive Data Type	VARCHAR2, CHAR, NCHAR2, NCHAR, CLOB	NUMBER, FLOAT, BINARY_NUMBER, BINARY_FLOAT	BLOB	RAW	DATE, TIMESTAMP, TIMESTAMP WITH TZ, TIMESTAMP WITH LOCAL TZ	INTERVAL YEAR TO MONTH, INTERVAL DAY TO SECOND
INT SMALLINT TINYINT BIGINT	yes	yes	yes	yes	no	no
DOUBLE FLOAT	yes	yes	yes	yes	no	no
DECIMAL	yes	yes	no	no	no	no
BOOLEAN	yes^Foot 1	yes^Foot 2	yes^Foot 2	yes	no	no
BINARY	yes	no	yes	yes	no	no
STRING	yes	yes	no	no	yes	yes
TIMESTAMP	yes	no	no	no	yes	no
STRUCT ARRAY UNIONTYPE MAP	yes	no	no	no	no	no

^Footnote 1

FALSE maps to the string FALSE, and TRUE maps to the string TRUE.

^Footnote 2

FALSE maps to 0, and TRUE maps to 1.

3.3 Create an External Table for Oracle NoSQL Database

You can use the ORACLE_HIVE access driver to access data stored in Oracle NoSQL Database. However, you must first create a Hive external table that accesses the KVStore. Then you can create an external table in Oracle Database over it, similar to the process described in "Create an External Table for Hive Data".

Note:

Oracle Big Data SQL 4.1.1, by default, is configured for Oracle NoSQL Database 20.2 which supports CDH 6. Customers who install Oracle Big Data SQL 4.1.1 on CDH5 clusters and require Oracle NoSQL Database integration need to configure Oracle NoSQL Database 18.x. See Oracle NoSQL Database Documentation.

Topics

See Also:

Hive to Oracle Data Type Conversions provides details on Hive to Oracle Database data type mappings.
Predicate Pushdown in Oracle Big Data SQL requires that certain mappings between Hive Data types and Oracle Datatypes be present. See About Predicate Push Down.
For further information on integrating Oracle Big Data SQL with Oracle NoSQL Database, see Part III - Integration with Big Data SQL in Oracle NoSQL Database Integrations Guide .

3.3.1 Create a Hive External Table for Oracle NoSQL Database

To provide access to the data in Oracle NoSQL Database, you create a Hive external table over the Oracle NoSQL table. Oracle Big Data SQL provides a StorageHandler named oracle.kv.hadoop.hive.table.TableStorageHandler that enables Hive to read the Oracle NoSQL Database table format.

The following is the basic syntax of a Hive CREATE TABLE statement for a Hive external table over an Oracle NoSQL table:

CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...] 
STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' 
TBLPROPERTIES (
	   "oracle.kv.kvstore" = "database", 
   "oracle.kv.hosts" = "nosql_node1:port[, nosql_node2:port...]", 
   "oracle.kv.hadoop.hosts" = "hadoop_node1[,hadoop_node2...]", 
   "oracle.kv.tableName" = "table_name");

Hive CREATE TABLE Parameters

tablename

The name of the Hive external table being created.

This table name will be used in SQL queries issued in Oracle Database, so choose a name that is appropriate for users. The name of the external table that you create in Oracle Database must be identical to the name of this Hive table.

Table, column, and field names are case insensitive in Oracle NoSQL Database, Apache Hive, and Oracle Database.

colname coltype

The names and data types of the columns in the Hive external table. See About Oracle NoSQL to Oracle Database Type Mappings for the data type mappings between Oracle NoSQL Database and Hive.

Hive CREATE TABLE TBLPROPERTIES Clause

oracle.kv.kvstore: The name of the KVStore. Only upper- and lowercase letters and digits are valid in the name.
oracle.kv.hosts: A comma-delimited list of host names and port numbers in the Oracle NoSQL Database cluster. Each string has the format hostname:port. Enter multiple names to provide redundancy in the event that a host fails.
oracle.kv.hadoop.hosts: A comma-delimited list of all host names in the Hadoop cluster with Oracle Big Data SQL enabled.
oracle.kv.tableName: The name of the table in Oracle NoSQL Database that stores the data for this Hive external table.

See Also:

Integration with Apache Hive in Oracle NoSQL Database Integrations Guide
Apache Hive Language Manual DDL

3.3.2 Create the Oracle Database Table for Oracle NoSQL Data

Use the following syntax to create an external table in Oracle Database that can access the Oracle NoSQL data through a Hive external table:

CREATE TABLE tablename(colname colType[, colname colType...]) 
  ORGANIZATION EXTERNAL 
    (TYPE ORACLE_HIVE DEFAULT DIRECTORY directory 
     ACCESS PARAMETERS 
         (access parameters)
    ) 
    REJECT LIMIT UNLIMITED;

In this syntax, you identify the column names and data types. For more about this syntax, see "About Creating External Tables".

3.3.3 About Oracle NoSQL to Oracle Database Type Mappings

When Oracle Big Data SQL retrieves data from Oracle NoSQL Database, the data is converted twice to another data type:

Firstly: To a Hive data type when the data is read into the columns of the Hive external table.
Secondly: To an Oracle data type when the data is read into the columns of an Oracle Database external table.

In order to execute a Big Data SQL query against data stored in an Oracle NoSQL Database table, a Hive external table must first be created with a schema mapped from the schema of the desired Oracle NoSQL Database table. See Mapping RDBMS Data Model in Oracle NoSQL Database Integrations Guide for the data types Oracle NoSQL Database supports and their mappings to Hive.

Note:

To complete this mapping a corresponding Oracle Database external table must be created with a schema mapped from the schema of the Hive table.

Also note that some Hive data types are not applicable to the mapping of Oracle NoSQL data types to Oracle Database data types. See Mapping RDBMS Data Model in Oracle NoSQL Database Integrations Guide.

See Also:

Hive to Oracle Data Type Conversions provides details on Hive to Oracle Database data type mappings.
Predicate Pushdown in Oracle Big Data SQL requires that certain mappings between Hive Data types and Oracle Datatypes be present. See About Predicate Push Down.

3.3.4 Example of Accessing Data in Oracle NoSQL Database

For additional discussion of how Oracle NoSQL Database interacts with Oracle Big Data SQL, as well as for detailed example programs and utility classes, see Part III - Integration with Oracle Big Data SQL in Oracle NoSQL Database Integrations Guide.

3.4 Create an Oracle External Table for Apache HBase

You can also use the ORACLE_HIVE access driver to access data stored in Apache HBase. However, you must first create a Hive external table that accesses the HBase table. Then you can create an external table in Oracle Database over it. The basic steps are the same as those described in "Create an External Table for Oracle NoSQL Database".

3.4.1 Creating a Hive External Table for HBase

To provide access to the data in an HBase table, you create a Hive external table over it. Apache provides a storage handler and a SerDe that enable Hive to read the HBase table format.

The following is the basic syntax of a Hive CREATE TABLE statement for an external table over an HBase table:

CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...] 
ROW FORMAT
   SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (
   'serialization.format'='1',
   'hbase.columns.mapping'=':key,value:key,value:

See Also:

Apache Hive Language Manual DDL at

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable
Hive HBase Integration at

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-StorageHandlers
Class HBaseSerDe in the Apache Hive Javadocs at https://hive.apache.org/javadoc.html

3.4.2 Creating the Oracle Database Table for HBase

Use the following syntax to create an external table in Oracle Database that can access the HBase data through a Hive external table:

CREATE TABLE tablename(colname colType[, colname colType...]) 
  ORGANIZATION EXTERNAL 
    (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR 
     ACCESS PARAMETERS 
         (access parameters)
    ) 
    REJECT LIMIT UNLIMITED;

In this syntax, you identify the column names and data types. To specify the access parameters, see "About Creating External Tables".

3.5 Create an Oracle External Table for HDFS Files

The ORACLE_HDFS access driver enables you to access many types of data that are stored in HDFS, but which do not have Hive metadata. You can define the record format of text data, or you can specify a SerDe for a particular data format.

You must create the external table for HDFS files manually, and provide all the information the access driver needs to locate the data, and parse the records and fields. The following are some examples of CREATE TABLE ORGANIZATION EXTERNAL statements.

Oracle Big Data SQL now supports erasure coding (EC) with CDH 6.1 (and above), and HDP.

3.5.1 Use the Default Access Parameters with `ORACLE_HDFS`

The following statement creates a table named ORDER to access the data in all files stored in the /usr/cust/summary directory in HDFS:

CREATE TABLE ORDER (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_total NUMBER (8,2))
  ORGANIZATION EXTERNAL 
  ( TYPE oracle_hdfs
    DEFAULT DIRECTORY DEFAULT_DIR 
  )
  LOCATION ('hdfs:/usr/cust/summary/*');

Because no access parameters are set in the statement, the ORACLE_HDFS access driver uses the default settings to do the following:

Connects to the default Hadoop cluster.
Reads the files as delimited text, and the fields as type STRING.
Assumes that the number of fields in the HDFS files match the number of columns (three in this example).
Assumes the fields are in the same order as the columns, so that CUST_NUM data is in the first field, ORDER_NUM data is in the second field, and ORDER_TOTAL data is in the third field.
Rejects any records in which the value causes a data conversion error: If the value for CUST_NUM exceeds 10 characters, the value for ORDER_NUM exceeds 20 characters, or the value of ORDER_TOTAL cannot be converted to NUMBER.

See Also:

The section CREATE TABLE ACCESS PARAMETERS Clause provides the complete list of access parameters for ORACLE_HIVE, ORACLE_HDFS, and ORACLE_BIGDATA.

3.5.2 `ORACLE_HDFS` LOCATION Clause

The LOCATION clause for ORACLE_HDFS contains a comma-separated list of file locations. The files must reside in the HDFS file system on the default cluster.

A location can be any of the following:

A fully qualified HDFS directory name, such as /user/hive/warehouse/hive_seed/hive_types. ORACLE_HDFS uses all files in the directory.
A fully qualified HDFS file name, such as /user/hive/warehouse/hive_seed/hive_types/hive_types.csv
A URL for an HDFS file or a set of files, such as hdfs:/user/hive/warehouse/hive_seed/hive_types/*. It is invalid to use the directory name alone.

The file names can contain any pattern-matching character described in Table 3-2.

Table 3-2 Pattern-Matching Characters

Character	Description
?	Matches any one character
*	Matches zero or more characters
[abc]	Matches one character in the set {a, b, c}
[a-b]	Matches one character in the range {a...b}. The character must be less than or equal to b.
[^a]	Matches one character that is not in the character set or range {a}. The carat (^) must immediately follow the left bracket, with no spaces.
\c	Removes any special meaning of c. The backslash is the escape character.
{ab\,cd}	Matches a string from the set {ab, cd}. The escape character (\) removes the meaning of the comma as a path separator.
{ab\,c{de\,fh}	Matches a string from the set {ab, cde, cfh}. The escape character (\) removes the meaning of the comma as a path separator.

3.5.3 Override the Default `ORACLE_HDFS` Settings

You can use many of the same access parameters with ORACLE_HDFS as ORACLE_HIVE.

3.5.3.1 Access a Delimited Text File

The following example is equivalent to the one shown in "Override the Default ORACLE_HIVE Settings". The external table accesses a delimited text file stored in HDFS.

CREATE TABLE taxis
(
 dispatching_base_num varchar2(100),
 pickup_date varchar2(100),
 location_id varchar2(100)
)
ORGANIZATION EXTERNAL
  (TYPE ORACLE_HDFS
   DEFAULT DIRECTORY DEFAULT_DIR
  ACCESS PARAMETERS
  (
    com.oracle.bigdata.fileformat=TEXTFILE
    com.oracle.bigdata.rowformat=DELIMITED FIELDS TERMINATED BY ','
  )
   LOCATION ('/data/taxi-trips/')
  )
REJECT LIMIT UNLIMITED;

Note that there is no colmap field, since this source contains no metadata that describes columns. The only access parameters needed in this case are fileformat and rowformat.

Tip:

Instead of a colon (:) as the delimiter between fields and values, it is better to use the equal sign (=). This is because certain tools prompt you for a parameter value when they encounter the colon delimiter.

3.5.3.2 Access Avro Container Files

The next example uses a SerDe to access Avro container files.

CREATE TABLE order (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_date DATE,
                    item_cnt NUMBER,
                    description VARCHAR2(100),
                    order_total NUMBER(8,2)) 
   ORGANIZATION EXTERNAL 
   (
      TYPE oracle_hdfs              
      DEFAULT DIRECTORY DEFAULT_DIR 
      ACCESS PARAMETERS (
         com.oracle.bigdata.rowformat: \
         SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
         com.oracle.bigdata.fileformat: \
         INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'\ 
         OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
         com.oracle.bigdata.colmap: {  "col":"item_cnt", \
                 "field":"order_line_item_count"}
         com.oracle.bigdata.overflow: {"action":"TRUNCATE", \
                  "col":"DESCRIPTION"}
      )
      LOCATION ('hdfs:/usr/cust/summary/*'));

The access parameters provide the following information to the ORACLE_HDFS access driver:

com.oracle.bigdata.rowformat: Identifies the SerDe that the access driver needs to use to parse the records and fields.
com.oracle.bigdata.fileformat: Identifies the Java classes that can extract records and output them in the desired format.
com.oracle.bigdata.colmap: Handles differences in column names. ORACLE_HDFS matches ORDER_LINE_ITEM_COUNT in the HDFS files with the ITEM_CNT column in the external table.
com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated.

3.5.3.3 Access JSON Data

Oracle Big Data SQL user functionality built into Oracle SQL in order to parse data in JSON format.

Oracle SQL can parse JSON data accessible in columns (which may be external data or data stored inside the database).

For example, here is a JSON file called station_information.json, stored in HDFS.

{
"station_id":"72","name":"W 52 St & 11 Ave","short_name":"6926.01",
"lat":40.76727216,"lon":-73.99392888,"region_id":71,"rental_methods":["CREDITCARD","KEY"],
"capacity":39,"rental_url":"http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=72",
"eightd_has_key_dispenser":false}{"station_id":"79","name":"Franklin St & W Broadway",
"short_name":"5430.08","lat":40.71911552,"lon":-74.00666661,"region_id":71,
"rental_methods":["CREDITCARD","KEY"],
"capacity":33,"rental_url":"http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=79",
"eightd_has_key_dispenser":false}{"station_id":"82",
"name":"St James Pl & Pearl St","short_name":"5167.06","lat":40.71117416,
"lon":-74.00016545,"region_id":71,"rental_methods":["CREDITCARD","KEY"],
"capacity":27,"rental_url":"http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=82",
"eightd_has_key_dispenser":false
}

To query this data, do the following.

First create an HDFS external table over this data. Add a single text as column as VARCHAR2. (For large JSON objects, you can use CLOB instead.)

CREATE TABLE
    bikes.stations_ext ( 
        doc varchar2(4000) 
    ) 
    ORGANIZATION EXTERNAL      
    ( TYPE ORACLE_HDFS       
      DEFAULT DIRECTORY DEFAULT_DIR       
      LOCATION ('/data/bike-stations')     
    )
    REJECT LIMIT UNLIMITED;

Then query the external table. Note the use of Oracle SQL's JSON_VALUE function.

select 
    s.doc.station_id, s.doc.name, s.doc.lon as longitude,        
    s.doc.lat as latitude, s.doc.capacity,        
    s.doc.eightd_has_key_dispenser, s.doc.rental_methods,        
    json_value(doc, '$.rental_methods[0]'),
    json_value(doc, '$.rental_methods[1]') from stations_exts

See Also:

The Oracle SQL Language Reference provides syntax and more examples for JSON_VALUE.

3.6 Create an Oracle External Table for Kafka Topics

Oracle Big Data SQL has two options for working with Kafka data: using the Hive storage handler for Kafka, and using Oracle SQL access to Kafka.

Using the Hive storage handler for Kafka
The ORACLE_HIVE access driver can access Kafka topics using the Hive storage handler that is part of the Oracle Big Data SQL install. You first create a Hive external table that accesses the Kafka topics, and then create an Oracle Big Data SQL table over it. See Use Oracle's Hive Storage Handler for Kafka to Create a Hive External Table for Kafka Topics.
Using Oracle SQL Access to Kafka
Oracle SQL Access to Kafka enables Oracle SQL to access Kafka topics, without using Hive or Hadoop. The ORA_KAFKA PL/SQL package contains functions and procedures to enable this. You register a Kafka cluster in your database schema using ORA_KAFKA.REGISTER_CLUSTER, and then create views using the ORA_KAFKA.CREATE_VIEWS procedure. This procedure creates external tables under the covers, you do not have to explicitly create external tables. See Oracle SQL Access to Kafka.

Comparison between the above two options:

Hive Storage Handler for Kafka	Oracle SQL Access to Kafka
Use Big Data SQL syntax	Use `ORA_KAFKA` functions and procedures.
Requires Hive and Hadoop	Does not require Hive and Hadoop, requires a Java process to run on the database system.
Uses predicate pushdown when answering queries: Best option for queries that require predicate pushdown. For example, queries that are searching through large volumes of data in Kafka. Predicate pushdown is supported for conjuctive queries (a AND b AND c) defined on the "partitioned", "topic", "offset" and "timestamp" columns.	Uses Kafka offset and Kafka timestamp management to read messages by offset or timestamp: Best option for queries looking for the most recent Kafka messages (for example, messages from 5 seconds ago until current time), or a sequence of messages starting from a specific offset or timestamp (for example a 1000 messages starting from timestamp t or offset o).
Supports messages in Avro (schema-less or schema-full) formats	Supports messages in CSV and JSON formats
Enables Hive query of Kafka topics	Enables multiple database views to query Kafka, each application can have its own set of views and offset management so that each application can read a different part of the Kafka stream.

3.6.1 Use Oracle's Hive Storage Handler for Kafka to Create a Hive External Table for Kafka Topics

The Hive storage handler for Kafka enables Hive and Oracle Big Data SQL to query Kafka topics.

To provide access to Kafka data, you create a Hive external table over the Kafka topics. The Oracle Big Data SQL storage handler that enables Hive to read the Kafka data format is oracle.hadoop.kafka.hive.KafkaStorageHandler .

You can use this storage handler to create external Hive tables backed by data residing in Kafka. Big Data SQL can then query the Kafka data through the external Hive tables.

The Hive DDL is demonstrated by the following example, where topic1 and topic2 are two topics in Kafka broker whose keys are serialized by Kafka's String serializer and whose values are serialized by Kafka's Long serializer.

CREATE EXTERNAL TABLE test_table
row format serde ‘oracle.hadoop.kafka.hive.KafkaSerDe’
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
tblproperties('oracle.kafka.table.key.type'='string',
                     'oracle.kafka.table.value.type'='long',
                     'oracle.kafka.bootstrap.servers'='nshgc0602:9092',
                     'oracle.kafka.table.topics'='topic1,topic2');

The example below shows the resulting Hive table. The Kafka key, value, offset, topic name, and partitionid are mapped to Hive columns. You can explicitly designate the offset for each topic/partition pair through a WHERE clause in you Hive query.

hive> describe test_table;
OK
topic            string              	from deserializer   
partitionid      int                 	from deserializer   
key              string              	from deserializer   
value            bigInt              	from deserializer   
offset           bigint              	from deserializer
timestamptype    smallInt            from deserializer
timestamp        timestamp           from deserializer
Time taken: 0.084 seconds, Fetched: 7 row(s)

The content of the table is a snapshot of the Kafka topics when the Hive query is executed. When new data is inserted into the Kafka topics, you can use the offset column or the timestamp column to track the changes to the topic. The offsets are per topic/partition. For example, the following query will return new messages after the specified offsets in the where clause for each topic/partition:

hive> SELECT * \ 
FROM test_table \
WHERE (topic="topic1" and partitoinid=0 and offset > 199) \ 
OR (topic="topic1" and partitionid=1 and offset > 198)  \
OR (topic="topic2" and partitionid=0 and offset > 177)  \
OR (topic="topic2" and partitionid=1 and offset > 176);

You need to keep track of the offsets for all topics and partitions. For example, you can use an Oracle table to store these offsets. A more convenient way to keep track of new data is using the timestamp column. You can query data after a specific time point using the following query:

hive> SELECT * FROM test_table WHERE timestamp > '2017-07-12 11:30:00';

See the Property Reference section below for descriptions of all table properties

Property Reference

Table 3-3 Table Properties of Hive Storage Handler for Kafka

Property Name	Requirement	Description
oracle.kafka.table.topics	Required	A comma-separated list of Kafka topics. Each Kafka topic name must consists of only letters (uppercase and lowercase), numbers, .(dot), _(underscore), and -(minus). The maximum length for each topic name is 249. These topics must have the same serialization mechanisms. The resulting Hive table consists of records from all the topics listed here. A Hive column “topic” will be added and it will be set to the topic name for each record.
oracle.kafka.bootstrap.servers	Required	This property will be translated to the “bootstrap.servers” property for the underlying Kafka consumer. The consumer makes use of all servers, irrespective of which servers are specified here for bootstrapping. This list only impacts the initial hosts used to discover the full set of servers. This list should be in the form `host1:port1,host2:port2,....` Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers. For availability reasons, you may want to list more than one server.
oracle.kafka.table.key.type	Optional	The key type for your record. If unset, then the key part of the Kafka record will be ignored in the Hive row. Only values of “string”, “integer”, “long”, “double”, “avro”, “avro_confluent”are supported. “string”, “integer”, “double” and “long” correspond to the built-in primitive serialization types supported by Kafka. If this property is one of these primitive types, then the Kafka key for each record will be mapped to one single Hive Column. If this property is set to “avro” or “avro_confluent”, then `oracle.kafka.table.key.schema` is required. The Kafka key for each record will be deserialized into an Avro Object. If the Avro schema is of record type then each first level field of the record will be mapped to a single Hive column. If the Avro schema is not of Record Type, then it will be mapped to a single Hive Column named “key”. The difference between “avro” and “avro_confluent” is that the wire format for the serialization is slightly different. For “avro”, the entire bytes array of the key consists of the bytes of avro serialization. For “avro_confluent”, the bytes array consists of a magic byte, a version number, then the bytes of avro serialization of the key.
oracle.kafka.table.value.type	Optional	The value type of your record. If unset, then the value part of Kafka record will be ignored in the Hive row. Use of this property is similar to use of `oracle.kafka.table.key.type`. The difference between them is: when the Avro Schema for Kafka value is not of record type. The whole Avro object will be mapped to a single Hive Column named “value” instead of “key”.
oracle.kafka.table.key.writer.schema	Optional	An optional writer schema for the Kafka key’s Avro serialization. It’s required when the reader schema for the key is different from the schema in which the keys are written to Kafka brokers. It must be the exact schema in which Kafka keys are serialized.
oracle.kafka.table.key.schema	Required when “oracle.kafka.table.key.type” is “avro” or “avro_confluent”	The JSON string for the Kafka key's Avro reader schema. It doesn't need to be exactly the same as the Kafka key's writer Avro schema. As long as the reader schema is compatible with the Kafka key or the converted object from the converter, it is valid. This enables you to rename Hive columns and choose what fields to keep from the Kafka key in the Hive row. If the schema in this property is different from the schema in which the Kafka keys are serialized, then `oracle.kafka.table.key.writer.schema` is required.
oracle.kafka.table.value.writer.schema	Optional	An optional writer schema for the Kafka value’s Avro serialization. Its use is similar to `oracle.kafka.table.key.writer.schema`.
oracle.kafka.table.value.schema	Required when “oracle.kafka.table.value.type” is “avro” or “avro_confluent”	The JSON string for the Kafka value's Avro reader schema. Its use is similar to `oracle.kafka.table.key.schema`.
oracle.kafka.table.extra.columns	Optional, default to “true”	A boolean flag to control whether to include extra Kafka columns: `paritionid`, `offset`, `timestamptype`.
oracle.kafka.chop.partition	Optional, default to false	A Boolean flag to control whether to chop Kafka partitions into smaller chunks. This is useful when the number of Kafka partitions is small and the size of each Kafka partition is large.
oracle.kafka.partition.chunk.size	Optional	When oracle.kafka.chop.partition is true, this property controls the number of Kafka records in each partition chunk. It should be set a value estimated by (Ideal size of a split)/(Average size of a Kafka record). For example, if the ideal size of a split is 256 MB and the average size of s Kafka record is 256 Bytes, then this property should be set to 1000000.

3.6.2 Create an Oracle Big Data SQL Table for Kafka Topics

Big Data SQL can use the ORACLE_HIVE access driver to query the Kafka source that is described using Hive metadata.

After you create a Hive table over Kafka data by using the Hive storage handler for Kafka, there are no special procedures for generating a Big Data SQL table from the resulting Hive table. The ORACLE_HIVE settings for Kafka sources are identical to other Hive sources. Below is an example that shows how to create the Oracle external table. Once created, you can query it as you would other Oracle tables.

CREATE TABLE test_table(
topic varchar2(50),
partitionid integer,
key varchar2(50),
value integer,
offset integer,
timestamptype integer,
timestamp     timestamp
)
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
   ACCESS PARAMETERS
      (
       com.oracle.bigdata.cluster=hadoop1
       com.oracle.bigdata.tablename=default.test_table
      )
) PARALLEL 2 REJECT LIMIT UNLIMITED

Some Common Questions

Is Oracle Big Data SQL access to Kafka Brokers parallelized? For example, if I have six nodes running Oracle Big Data SQL, will all six nodes participate in a single query to Kafka so that we have to create a consumer group across all nodes to read the Kafka topic? Or, will only one node be used for a single SELECT statement to query from Kafka?

Like any Big Data SQL query, a query to Kafka will engage all of the nodes where Oracle Big Data SQL is installed.
In a Kafka query, how can we accommodate new incoming data? Can I have a query that waits (for a specified timeout) for new data to come into Kafka?

To pick up new data, you can run the Oracle Big Data SQL query periodically and filter by offset and timestamp to only retrieve the new rows (rows since the last read).

See Also:

The following section of the Big Data SQL Quick Start blog provides more information on accessing Kafka through Oracle Big Data SQL – Big Data SQL Quick Start. Big Data SQL over Kafka – Part 23

3.7 Create an Oracle External Table for Object Store Access

The ORACLE_BIGDATA access driver enables you to create an external table over data stored in object stores. Oracle Big Data SQL currently supports access to Oracle Object Store, Amazon S3, and Azure Blob Storage.

ORACLE_BIGDATA is primarily intended to support queries against object stores. It uses Smart Scan and Oracle Big Data SQL cells for scale out and performance against these stores. You can also use this driver to query local data, which is useful for testing and smaller data sets.

The ORACLE_BIGDATA driver is similar to the ORACLE_HDFS driver in that you create tables over the raw files. It does not use a metadata store like Hive. You specify the metadata as part of the table definition.

However, unlike ORACLE_HDFS, ORACLE_BIGDATA does not use Java drivers and the standard Hadoop mechanisms for data access (SerDes, InputFormats, etc.). ORACLE_BIGDATA uses optimized C-drivers for all data access. It supports the text, Avro, ORC and Parquet file types. The text file type support is robust. For example, you can specify parameters for delimited text. You can also utilize Oracle's extensive JSON processing capabilities.

Steps for Accessing Data in Object Store

There are two steps required in order to access data in an object store:

Create a credential object
A credential object stores object store credentials in an encrypted format. The identity specified by the credential must have access to the underlying data in the object store.
Create an Oracle Big Data SQL external table whose type is ORACLE_BIGDATA. The create table statement must reference the credential object, which provides authentication against the object store. It also requires a LOCATION clause, which provides the URI to the files within the object store.

Creating the Credential Object

Use the DBMS_CREDENTIAL.CREATE_CREDENTIAL procedure to create your credential object. This object contains the username and password information needed to access the object store. This credential password must match the Auth Token created for the username in your cloud service.

execute dbms_credential.create_credential( 
   credential_name => '<my_credential>', 
   username        => '<username>', 
   password        => '<password>'
);

For the Native Oracle Cloud Infrastructure style credentials the object also contains the Key to access the object store. The Key must match the API Key for the specified user.

execute dbms_credential.create_credential(
   credential_name => '<my_credential>',
   username => '<user_ocid>',
   password => '',
   key => '{"tenancy_ocid":"<tenancy_ocid>","private_key":"<private_key>","fingerprint":"<fingerprint>"}');

Then specify the credential object name in the com.oracle.bigdata.credential.name parameter.

See Also:

The Oracle Database PL/SQL Packages and Types Reference describes how to use DBMS_CREDENTIAL.CREATE_CREDENTIAL

Defining the LOCATION Clause

The LOCATION is a URI pointing to data in the object store. Currently supported object stores are Oracle Object Store and Amazon S3. There are different ways of specifying the URI, depending on its source. Here are some examples.

For Native Oracle Cloud Infrastructure Object Storage, the URI format is:
```
location ('https://objectstorage.<region>.oraclecloud.com/n/<namespace>/b/<bucket>/o/<objectname>')
```
For Oracle Cloud Infrastructure Object Storage, the URI format for files is:
```
location ('https://swiftobjectstorage.<region>.oraclecloud.com/v1/<namespace>/<bucket>/<filename>')
```
For an Amazon S3 location, see https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html#access-bucket-intro.
The hosted-style URI format is as follows:
```
location ('https://<bucket>.<host>/<objectname>')
```
The path-style URI format is as follows:
```
location ('https://<host>/<bucket>/<objectname>')
```
A possible path-style example is:
```
location ('https://s3-us-west-2.amazonaws.com/adwc/<filename>')
```

For an Azure Blob Storage location, the URI format is:

location ('https://<host>:<port>/<container>/<blob>')

A possible example is:

location ('https://myaccount.blob.core.windows.net/mycontainer/myblob')

The credential object is required for object store access only. If the credential parameter is omitted, then the object must be in a public bucket. The user id associated with this credential must have access to read the data from object storage.

If you are testing access for data in object storage using local storage, you need to specify an Oracle directory object in the location - similar to what you do for ORACLE_LOADER data sources.

Setting Access Parameters

Like ORACLE_HDFS, ORACLE_BIGDATA requires information about how to access and parse the data. You provide this through access parameters. The following access parameter is required to specify the file format type:

com.oracle.bigdata.fileformat={textfile|avro|parquet|orc}

For delimited text files, the rowformat parameter is also required.

See Also:

ORACLE_BIGDATA Access Parameters in this guide provides tables that list common access parameters as well as those specific to each file type.

Note:

You can use ORACLE_BIGDATA to access local files for testing purposes or simple querying. In this case, the LOCATION field value is the same as what you would use for ORACLE_LOADER. You can use an Oracle directory object followed by the name of the file in the LOCATION field. For local files, a credential object is not required. However, you must have privileges over on the directory object in order to access the file.

3.7.1 Create Table Example for Object Store Access

This section describes how to build a CREATE TABLE statement for object store access.

As shown in the example below, the required components are as follows:

The schema for the external table.
The credential object.
The fileformat parameter and any access parameters that are particular to the file format. For example, delimited Text files require the rowformat parameter. Parquet and Avro require only the fileformat parameter. Note that the default file format is Parquet.
The correct LOCATION clause syntax for the particular object store.
Use of a DEFAULT DIRECTORY clause with a LOCATION that is local rather than in an object store.

See Also:

ORACLE_BIGDATA Access Parameters for the full set of available parameters for each file type.

Example: Accessing a File in an Object Store

Note:

Remember that for object store access you must first use DBMS_CREDENTIAL.CREATE_CREDENTIAL in the DBMS_CREDENTIAL PL/SQL package to create the credential object.

exec dbms_credential.create_credential(
credential_name => '<my_credential_object_name>',
username        => '<username>', 
password        => '<password>'
);

Then within the ACCESS PARAMETER clause of the statement, assign the name of the object to com.oracle.bigdata.credential.name as shown in the statement. A Parquet file in an object store is the target in this example.

CREATE TABLE tkexbaseballtab
  (date1       date,
   date2       timestamp,
   name        varchar2(30),
   nationality varchar2(20),
   age         number,
   team        varchar2(20),
   active      char(1),
   average     float,
   payroll       char(1),
   exid        VARCHAR2(20))
  ORGANIZATION EXTERNAL
  (TYPE ORACLE_BIGDATA
   ACCESS PARAMETERS
   (
     com.oracle.bigdata.debug=TRUE
     com.oracle.bigdata.credential.name=MY_CRED
     com.oracle.bigdata.fileformat=parquet
   )
   location ('https://<domain>.com/BIGDATA_PARQUET/<filename>.parquet')
  )  REJECT LIMIT UNLIMITED;

3.7.2 Access a Local File through an Oracle Directory Object

You can also use the ORACLE_BIGDATA driver to create an external table over a local file.

The statement below creates an external table over a text file in a local directory. For a JSON file, set the file format to textfile. No other access parameters are needed. You do not need to supply a credential when accessing local files. Do the same for a delimited text file, but if the fields terminator and/or line terminator used in the file are other than the default values, define those as well. In this example, we set the field terminator to the vertical bar (‘|’). This statement does not include a DEFAULT DIRECTORY clause. Instead in the LOCATION clause we include the name of the directory object where the file is located – DEF_DIR1 .

CREATE TABLE b19724081
(    CONTRACT_STATUS_CODE VARCHAR2(1),
     CONTRACT_STATUS_DESCRIPTION VARCHAR2(200),
     POSTING_AGENT VARCHAR2(50),
     DATA_ORIGIN VARCHAR2(50),
     WAREHOUSE_POSTING_TIMESTAMP DATE,
     WAREHOUSE_UPDATE_TIMESTAMP DATE
)
ORGANIZATION EXTERNAL
  ( TYPE ORACLE_BIGDATA
    DEFAULT DIRECTORY "DEF_DIR1"
    ACCESS PARAMETERS
    (
      com.oracle.bigdata.fileformat=textfile
      com.oracle.bigdata.csv.rowformat.fields.terminator = '|'
    )
  location ("DEF_DIR1":"<'filename>.csv')
  )
REJECT LIMIT UNLIMITED;

3.7.3 Conversions to Oracle Data Types

Oracle Big Data SQL supports conversion from Parquet, ORC, and Avro file data types to Oracle data types, but for scalar fields only. Non-scalar data types are converted to VARCHAR2(4000).

Complex types, such as arrays, structs, and so on, map to VARCHAR2 data types and are represented by a JSON string in a VARCHAR2 field.

Note:

If data for a column encounters a conversion error, for example, the target column is not large enough to hold the converted value, the value for the column will be set to NULL.

Table 3-4 ORACLE_BIGDATA Mappings From Parquet, ORC and Avro to Oracle Data Types

Type Description	Parquet	ORC	Avro	Supported Conversions to Oracle Data Types
decimal: arbitrary-precision signed decimal number of the form unscaled × 10^-scale	¹decimal (int32, int64, fixed_len_byte_array, byte_array)	²decimal (int64, int128)	³decimal (bytes, fixed)	number(p), number(p,s)
UTF8: UTF-8 encoded character string	string/utf8	char/string/varchar	string	varchar2
byte array/binary	⁴byte_array	binary	bytes	blob
byte array with fixed length	fixed_len_byte_array	-	fixed	blob
UUID	UUID (fixed_len_byte_array)	-	-	blob
Json	JSON (byte_array)	-	-	varchar2 2
Bson	Bson (byte_array)	-	-	-
date: number of days from the Unix epoch, 1 January 1970	date (int32)	date	date (int)	date
time (millis): number of milliseconds after midnight, 00:00:00.000	time-millis (int32)	-	time-millis (int)	timestamp(3)
time (micros): number of microseconds after midnight, 00:00:00.000000	time-micros (int64)	-	time-micros (long)	timestamp(6)
time (nanos): number of nanoseconds after midnight, 00:00:00.000000	time-nanos (int64)	-	-	timestamp(9)
time (millis) UTC	time-millis-UTC (int32)	-	-	timestamp(3)
time (micros) UTC	time-micros-UTC (int64)	-	-	timestamp(6)
time (nanos) UTC	time-nanos-UTC (int64)	-	-	timestamp(9)
timestamp (millis)	timestamp-millis (int64)	-	-	timestamp(3)
timestamp (micros)	timestamp-micros (int64)	-	-	timestamp(6)
timestamp (nanos)	timestamp-nanos (int64)	timestamp	-	timestamp(9)
timestamp (millis) UTC: number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC	timestampmillis-UTC (int64)	-	timestampmillis (long)	timestamp(3)
timestamp (micros) UTC: number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC	timestampmicros-UTC (int64)	-	timestampmicros (long)	timestamp(6)
timestamp (nanos) UTC: number of nanoseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC	timestampnanos-UTC (int64)	-	-	timestamp(9)
duration: number of months, days and milliseconds	interval (fixed, size: 12)	-	duration(fixed,size:12)	-
8-bit signed integer	int_8 (int32)	-	-	number(3), tinyint
16-bit signed integer	int_16 (int32)	-	-	number(5), smallint
32-bit signed integer	int_32 (int32)	-	int	number(10), int
64-bit signed integer	int_64 (int64)	-	long	number(20), bigint
8-bit unsigned integer	uint_8 (int32)	byte	-	number(3), tinyint
16-bit unsigned integer	uint_16 (int32)	short	-	number(5), smallint
32-bit unsigned integer	uint_32 (int32)	int	-	number(10), int
64-bit unsigned integer	uint_64 (int64)	long	-	number(20), bigint
96-bit signed integer	⁵int96	-	-	number(29), timestamp(9)
IEEE 32-bit floating point	float	float	float	binary_float
IEEE 64-bit floating point	double	double	double	binary_double
boolean	boolean	boolean	boolean	number(1)
null/no value	null	-	null	varchar2(1)
enum	enum(int)	-	enum(int)	varchar2

In Parquet, a decimal can be stored in int_32, int_64, as a fixed length byte array or as a byte array whose size is determined by the precision.
In ORC, decimals are stored in 128 bit integers.
In Avro, decimals are internally stored as byte arrays (fixed or not). Depending on the Avro writer, some of them store the string representation of the decimal, while others store the unscaled value. To avoid presenting ambiguous data to the user, it is recommended that the access parameter com.oracle.bigdata.avro.decimaltpe is used to explicitly declare which representation is used in the file. If this parameter is not explicitly specified we assume that the unscaled representation of the data is stored in the decimal columns of the file. See ORACLE_BIGDATA Access Parameters.
The binary type for Parquet is only available as a blob when com.oracle.bigdata.prq.binary_as_string is set to FALSE. See ORACLE_BIGDATA Access Parameters.
The int_96 type for Parquet is only available as number when com.oracle.bigdata.prq.int96_as_timestamp access parameter is set to FALSE. See ORACLE_BIGDATA Access Parameters.

3.7.4 `ORACLE_BIGDATA` Support for Compressed Files

ORACLE_BIGDATA driver support for access to compressed files in object stores is as follows:

Compressed Text, Parquet, Avro and ORC files
The driver can read from text, Parquet, Avro and ORC files compressed with gzip, bzip2, or zlib. The compression format can be detected automatically or specified. No parameters to handle compressed files are needed in the external table create statement.
The following ORACLE_BIGDATA access parameter is used to define the compression type:
```
com.oracle.bigdata.compressiontype=detect|gzip|zlib|bzip2
```
Here is an example of defining a gzip compression type:
```
com.oracle.bigdata.compressiontype=gzip
```
See ORACLE_BIGDATA Access Parameters for further details.

3.8 Query External Tables

Users can query external tables using the SQL SELECT statement, the same as they query any other table.

Note:

The MODIFY EXTERNAL clause is not allowed for any external table created through the ORACLE_BIGDATA driver.

3.8.1 Grant User Access

Users who query the data on a Hadoop cluster must have READ access in Oracle Database to the external table and to the database directory object that points to the cluster directory. See "About the Cluster Directory".

3.8.2 About Error Handling

By default, a query returns no data if an error occurs while the value of a column is calculated. Processing continues after most errors, particularly those thrown while the column values are calculated.

Use the com.oracle.bigdata.erroropt and com.oracle.bigdata.overflow parameters to determine how errors are handled.

3.8.3 About the Log Files

You can use these access parameters to customize the log files:

3.8.4 About File Readers

3.8.4.1 Using Oracle's Optimized Parquet Reader for Hadoop Sources

For reading parquet files, you have the option of using the custom Parquet reader for Hadoop sources. This proprietary driver improves performance and makes more efficient use of cluster resources.

Disabling or Re-Enabling the Custom Parquet Reader

The Parquet reader optimization is enabled by default. It can be disabled for an individual table by adding the following access parameter to the external table definition:

com.oracle.bigdata.useOracleParquet=false

You can add this setting to the cluster properties file to disable the optimization for all Parquet-based external tables. Remove the setting to return to the default.

Compatibility with Previously Created Parquet Format Data

Use of the custom reader requires no changes to data format. However, for best performance, the format must provide min and max values for each column for each Parquet block. These values are used by the standard Hadoop Parquet InputFormat, as well as the custom Parquet reader, to optimize the query. The resulting optimization significantly improves query performance with both Hive and Oracle Big Data SQL.

Note that Parquet files created by Impala do not include min and max values for each column for each Parquet block.

To ensure that min and max values are available, it is recommended that you write Parquet files with Hive or other tools that generate output in the standard Hadoop Parquet InputFormat, such as PrestoDB and Spark.

To check if a file includes these values, you can run the help about parquet-tools. On a CDH Hadoop distro, the parquet-tools command may also be configured in your path.

Here's an example of how to run parquet tools against a file called sales.parq:

parquet-tools meta sales.parq

The resulting output should contain for each column a min, max and number of nulls value:

c_quantity : ... :[min: 0, max: 109572, num_nulls: 55]c.total_sales : ... :[min: 0, max: 109571, num_nulls: 41]

3.9 About Oracle Big Data SQL on the Database Server (Oracle Exadata Machine or Other)

This section explains the changes that the Oracle Big Data SQL installation makes to the Oracle Database system (which may or may not be an Oracle Exadata Machine).

The section contains the following topics:

3.9.1 About the `bigdata_config` Directory

The directory bigdata_config contains configuration information that is common to all Hadoop clusters. This directory is located on the Oracle Database system under $ORACLE_HOME/bigdatasql. The oracle file system user (or whichever user owns the Oracle Database instance) owns bigdata_config . The Oracle Database directory ORACLE_BIGDATA_CONFIG points to bigdata_config.

3.9.2 Common Configuration Properties

The installation store these files in the bigdata_config directory under $ORACLE_HOME/bigdatasql :

The Oracle DBA can edit these configuration files as necessary.

3.9.2.1 bigdata.properties

The bigdata.properties file in the common directory contains property-value pairs that define the Java class paths and native library paths required for accessing data in HDFS.

These properties must be set:

The following list describes all properties permitted in bigdata.properties.

bigdata.properties

Property	Description
bigdata.cluster.default	The name of the default Hadoop cluster. The access driver uses this name when the access parameters do not specify a cluster. Required. Changing the default cluster name might break external tables that were created previously without an explicit cluster name.
bigdata.cluster.list	A comma-separated list of Hadoop cluster names. Optional.
java.classpath.hadoop	The Hadoop class path. Required.
java.classpath.hive	The Hive class path. Required.
java.classpath.oracle	The path to the Oracle JXAD Java JAR file. Required.
java.classpath.user	The path to user JAR files. Optional.
java.libjvm.file	The full file path to the JVM shared library (such as `libjvm.so`). Required.
java.options	A comma-separated list of options to pass to the JVM. Optional. This example sets the maximum heap size to 2 GB, and verbose logging for Java Native Interface (JNI) calls: Xmx2048m,-verbose=jni
java.options2	A space-delimited list of options to pass to the JVM. Optional. The delimiter must be a space character, not a tab or other whitespace character. This example sets the maximum heap size to 2 GB, and verbose logging for Java Native Interface (JNI) calls: Xmx2048m -verbose=jni Note: Notice that `java.options` is comma-delimited, while `java.options2` is space delimited. These two properties can coexist in the same `bigdata.properties` file.
LD_LIBRARY_PATH	A colon separated (:) list of directory paths to search for the Hadoop native libraries. Recommended. If you set this option, then do not set java.library path in `java.options`.

Example 3-1 shows a sample bigdata.properties file.

Example 3-1 Sample bigdata.properties File

# bigdata.properties
#
# Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
#
#    NAME
#      bigdata.properties - Big Data Properties File
#
#    DESCRIPTION
#      Properties file containing parameters for allowing access to Big Data
#      Fixed value properties can be added here 
#
 
java.libjvm.file=$ORACLE_HOME/jdk/jre/lib/amd64/server/libjvm.so
java.classpath.oracle=$ORACLE_HOME/hadoopcore/jlib/*:$ORACLE_HOME/hadoop/jlib/hver-2/*:$ORACLE_HOME/dbjava/lib/*
java.classpath.hadoop=$HADOOP_HOME/*:$HADOOP_HOME/lib/*
java.classpath.hive=$HIVE_HOME/lib/*
LD_LIBRARY_PATH=$ORACLE_HOME/jdk/jre/lib
bigdata.cluster.default=hadoop_cl_1

3.9.2.2 bigdata-log4j.properties

The bigdata-log4j.properties file in the common directory defines the logging behavior of queries against external tables in the Java code. Any log4j properties are allowed in this file.

Example 3-2 shows a sample bigdata-log4j.properties file with the relevant log4j properties.

Example 3-2 Sample bigdata-log4j.properties File

# bigdata-log4j.properties
#
# Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
#
#    NAME
#      bigdata-log4j.properties - Big Data Logging Properties File
#
#    DESCRIPTION
#      Properties file containing logging parameters for Big Data
#      Fixed value properties can be added here
 
bigsql.rootlogger=INFO,console
log4j.rootlogger=DEBUG, file
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.logger.oracle.hadoop.sql=ALL, file
 
bigsql.log.dir=.
bigsql.log.file=bigsql.log
log4j.appender.file.File=$ORACLE_HOME/bigdatalogs/bigdata-log4j.log

See Also:

Apache Logging Services documentation at

http://logging.apache.org/log4j/1.2/manual.html

3.9.3 About the Cluster Directory

The cluster directory contains configuration information for a Hadoop cluster. Each cluster that Oracle Database accesses using Oracle Big Data SQL has a cluster directory. This directory is located on the Oracle Database system under $ORACLE_HOME/bigdatasql/clusters/. For example, cluster bda1_cl_1 would have a directory $ORACLE_HOME/bigdatasql/clusters/bda1_c1_1 and under $ORACLE_HOME/bigdatasql/clusters/bda1_c1_1/config would be the following files for client configuration files for accessing the cluster:

bigdata.hosts (not editable by customers)
core-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml (optional)
log4j property files (such as hive-log4j.properties)

$ORACLE_HOME/bigdatasql/databases/<database name>/bigdata_config/default_cluster is a soft link to the directory of the default cluster.

A database directory object points to the cluster directory. Users who want to access the data in a cluster must have read access to the directory object.

3.9.4 About Permissions

On the Oracle database server, ensure that the oracle user (or whatever user owns the Oracle Database installation directory) has READ/WRITE access to the database directory that points to the log directory.

On the Hadoop side, when you run Database Acknowledge (# ./jaguar databaseack [config file]) this operation creates an account for the database owner and grants required permissions.

3.1 About Creating External Tables

3.1.1 Basic Create Table Syntax

3.1.2 About the External Table Clause

3.2 Create an External Table for Hive Data

3.2.1 Obtain Information About a Hive Table

3.2.2 Use the CREATE_EXTDDL_FOR_HIVE Function

3.2.3 Use Oracle SQL Developer to Connect to Hive

3.2.4 Develop a CREATE TABLE Statement for ORACLE_HIVE

3.2.4.1 Use the Default ORACLE_HIVE Settings

3.2.4.2 Override the Default ORACLE_HIVE Settings

3.2.5 Hive to Oracle Data Type Conversions

3.3 Create an External Table for Oracle NoSQL Database

3.3.1 Create a Hive External Table for Oracle NoSQL Database

3.3.2 Create the Oracle Database Table for Oracle NoSQL Data

3.3.3 About Oracle NoSQL to Oracle Database Type Mappings

3.3.4 Example of Accessing Data in Oracle NoSQL Database

3.4 Create an Oracle External Table for Apache HBase

3.4.1 Creating a Hive External Table for HBase

3.4.2 Creating the Oracle Database Table for HBase

3.5 Create an Oracle External Table for HDFS Files

3.5.1 Use the Default Access Parameters with ORACLE_HDFS

3.5.2 ORACLE_HDFS LOCATION Clause

3.5.3 Override the Default ORACLE_HDFS Settings

3.5.3.1 Access a Delimited Text File

3.5.3.2 Access Avro Container Files

3.5.3.3 Access JSON Data

3.6 Create an Oracle External Table for Kafka Topics

3.6.1 Use Oracle's Hive Storage Handler for Kafka to Create a Hive External Table for Kafka Topics

3.6.2 Create an Oracle Big Data SQL Table for Kafka Topics

3.7 Create an Oracle External Table for Object Store Access

3.7.1 Create Table Example for Object Store Access

3.7.2 Access a Local File through an Oracle Directory Object

3.7.3 Conversions to Oracle Data Types

3.7.4 ORACLE_BIGDATA Support for Compressed Files

3.8 Query External Tables

3.8.1 Grant User Access

3.8.2 About Error Handling

3.8.3 About the Log Files

3.8.4 About File Readers

3.8.4.1 Using Oracle's Optimized Parquet Reader for Hadoop Sources

3.9 About Oracle Big Data SQL on the Database Server (Oracle Exadata Machine or Other)

3.9.1 About the bigdata_config Directory

3.9.2 Common Configuration Properties

3.9.2.1 bigdata.properties

3.9.2.2 bigdata-log4j.properties

3.9.3 About the Cluster Directory

3.9.4 About Permissions

3.1.1 Basic `Create Table` Syntax

3.2.2 Use the `CREATE_EXTDDL_FOR_HIVE` Function

3.2.4 Develop a `CREATE TABLE` Statement for `ORACLE_HIVE`

3.2.4.1 Use the Default `ORACLE_HIVE` Settings

3.2.4.2 Override the Default `ORACLE_HIVE` Settings

3.5.1 Use the Default Access Parameters with `ORACLE_HDFS`

3.5.2 `ORACLE_HDFS` LOCATION Clause

3.5.3 Override the Default `ORACLE_HDFS` Settings

3.7.4 `ORACLE_BIGDATA` Support for Compressed Files

3.9.1 About the `bigdata_config` Directory