Using Query Processing Engines to Generate Code in Different Languages

6.1 Query Processing Engines Supported by Oracle Data Integrator

Hadoop provides a framework for parallel data processing in a cluster. There are different languages that provide a user front-end. Oracle Data Integrator supports the following query processing engines to generate code in different languages:

Hive

The Apache Hive warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
Pig

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin.
Spark

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop Input Format.

To generate code in these languages, you need to set up Hive, Pig, and Spark data servers in Oracle Data Integrator. These data servers are to be used as the staging area in your mappings to generate HiveQL, Pig Latin, or Spark code.

Generate Code in Different Languages with Oracle Data Integrator

6.2 Setting Up Hive Data Server

To set up the Hive data server:

Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Hive and then click New Data Server.
In the Definition tab, specify the details of the Hive data server.

See Hive Data Server Definition for more information.
In the JDBC tab, specify the Hive data server connection details.

See Hive Data Server Connection Details for more information.
Click Test Connection to test the connection to the Hive data server.

6.2.1 Hive Data Server Definition

The following table describes the fields that you need to specify on the Definition tab when creating a new Hive data server.

Note: Only the fields required or specific for defining a Hive data server are described.

Table 6-1 Hive Data Server Definition

Field	Description
Name	Name of the data server that appears in Oracle Data Integrator.
Data Server	Physical name of the data server.
User/Password	Hive user with its password.
Metastore URI	Hive Metastore URIs: for example, `thrift://BDA:10000`.
Hadoop Data Server	Hadoop data server that you want to associate with the Hive data server.
Additional Classpath	Additional classpaths.

Setting Up Hive Data Server

Configuring Big Data technologies using the Big Data Configurations Wizard

6.2.2 Hive Data Server Connection Details

The following table describes the fields that you need to specify on the JDBC tab when creating a new Hive data server.

Note: Only the fields required or specific for defining a Hive data server are described.

Table 6-2 Hive Data Server Connection Details

Field Description

Field	Description
JDBC Driver	`Apache Hive DataDirect Driver` Use this JDBC driver to connect to the Hive Data Server.
JDBC URL	`jdbc:weblogic:hive://<host>:<port>[; property=value[;...]]` For example, `jdbc:weblogic:hive://localhost:10000;DatabaseName=default;User=default;Password=default` Kerberized: `jdbc:weblogic:hive://<host>:<port>;DatabaseName=<value>;AuthenticationMethod=kerberos;ServicePrincipalName=<value>` For example, `jdbc:weblogic:hive://localhost:10000;DatabaseName=default;AuthenticationMethod=kerberos;ServicePrincipalName=hive`

JDBC Driver

Apache Hive DataDirect Driver

Use this JDBC driver to connect to the Hive Data Server.

JDBC URL

jdbc:weblogic:hive://<host>:<port>[; property=value[;...]]

For example, jdbc:weblogic:hive://localhost:10000;DatabaseName=default;User=default;Password=default

Kerberized: jdbc:weblogic:hive://<host>:<port>;DatabaseName=<value>;AuthenticationMethod=kerberos;ServicePrincipalName=<value>

For example, jdbc:weblogic:hive://localhost:10000;DatabaseName=default;AuthenticationMethod=kerberos;ServicePrincipalName=hive

Setting Up Hive Data Server

6.3 Creating a Hive Physical Schema

Create a Hive physical schema using the standard procedure, as described in the Creating a Physical Schema section in Administering Oracle Data Integrator.

Create for this physical schema a logical schema using the standard procedure, as described in the Creating a Logical Schema section in Administering Oracle Data Integrator and associate it in a given context.

Setting Up Hive Data Server

6.4 Setting Up Pig Data Server

To set up the Pig data server:

Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Pig and then click New Data Server.
In the Definition tab, specify the details of the Pig data server.

See Pig Data Server Definition for more information.
In the Properties tab, add the Pig data server properties.

See Pig Data Server Properties for more information.
Click Test Connection to test the connection to the Pig data server.

6.4.1 Pig Data Server Definition

The following table describes the fields that you need to specify on the Definition tab when creating a new Pig data server.

Note: Only the fields required or specific for defining a Pig data server are described.

Table 6-3 Pig Data Server Definition

Field	Description
Name	Name of the data server that will appear in Oracle Data Integrator.
Data Server	Physical name of the data server.
Process Type	Choose one of the following: Local Mode Select to run the job in local mode. In this mode, pig scripts located in the local file system are executed. MapReduce jobs are not created. MapReduce Mode Select to run the job in MapReduce mode. In this mode, pig scripts located in the HDFS are executed. MapReduce jobs are created. Note: If this option is selected, the Pig data server must be associated with a Hadoop data server.
Hadoop Data Server	Hadoop data sever that you want to associate with the Pig data server. Note: This field is displayed only when the MapReduce Mode option is set to Process Type.
Additional Classpath	Specify additional classpaths. Add the following additional classpaths: Local Mode `/<dir name>/pig/pig.jar` MapReduce Mode `/etc/hbase/conf` `/usr/lib/pig/lib` `/usr/lib/pig/pig-0.12.0-cdh<version>.jar` Replace <version> with the Cloudera version you have. For example, `/usr/lib/pig/pig-0.12.0-cdh5.10.0.jar`. `/usr/lib/hive-hcatalog/share/hcatalog` `/usr/lib/hbase/lib` `/usr/lib/hbase` For pig-hcatalog-hive, add the following classpath in addition to the ones mentioned above: `/usr/lib/hive-hcatalaog/share/hcatalog`
User/Password	Pig user with its password.

Setting Up Pig Data Server

Configuring Big Data technologies using the Big Data Configurations Wizard

6.4.2 Pig Data Server Properties

The following table describes the Pig data server properties that you need to add on the Properties tab when creating a new Pig data server.

Table 6-4 Pig Data Server Properties

Key	Value
hive.metastore.uris	`thrift://bigdatalite.localdomain:9083`
pig.additional.jars	`/usr/lib/hive-hcatalog/share/hcatalog/*.jar:/usr/lib/hive/`
hbase.defaults.for.version.skip	Set to true to skip the hbase.defaults.for.version check. Set this boolean to true to avoid seeing the RuntimException issue.
hbase.zookeeper.quorum	Quorum of the HBase installation. For example, `localhost:2181`.

Setting Up Pig Data Server

6.5 Creating a Pig Physical Schema

Create a Pig physical schema using the standard procedure, as described in the Creating a Physical Schema section in Administering Oracle Data Integrator.

Create for this physical schema a logical schema using the standard procedure, as described in the Creating a Logical Schema section in Administering Oracle Data Integrator and associate it in a given context.

Setting Up Pig Data Server

6.6 Setting Up Spark Data Server

To set up the Spark data server:

Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Spark Python and then click New Data Server.
In the Definition tab, specify the details of the Spark data server.

See Spark Data Server Definition for more information.
In the Properties tab, specify the properties for the Spark data server.

See Spark Data Server Properties for more information.
Click Test Connection to test the connection to the Spark data server.

Note:
The test connection button is disabled because Spark and Pig are not testable.

6.6.1 Spark Data Server Definition

The following table describes the fields that you need to specify on the Definition tab when creating a new Spark Python data server.

Note: Only the fields required or specific for defining a Spark Python data server are described.

Table 6-5 Spark Data Server Definition

Field	Description
Name	Name of the data server that will appear in Oracle Data Integrator.
Master Cluster (Data Server)	Physical name of the master cluster or the data server.
User/Password	Spark data server or master cluster user with its password.
Hadoop DataServer	Hadoop data server that you want to associate with the Spark data server. Note: This field appears only when you are creating the Spark Data Server using the Big Data Configurations wizard.
Additional Classpath	The following additional classpaths are added by default: `/usr/lib/spark/` `/usr/lib/spark/lib/` If required, you can add more additional classpaths. Note: This field appears only when you are creating the Spark Data Server using the Big Data Configuration wizard.

Setting Up Spark Data Server

Configuring Big Data technologies using the Big Data Configurations Wizard

6.6.2 Spark Data Server Properties

The following table describes the properties that you can configure on the Properties tab when defining a new Spark data server.

Note: Other than the properties listed in the following table, you can add Spark configuration properties on the Properties tab. The configuration properties that you add here are applied when mappings are executed. For more information about the configuration properties, refer to the Spark documentation available at the following URL:

http://spark.apache.org/docs/latest/configuration.html

Table 6-6 Spark Data Server Properties

Property	Description
archives	Comma separated list of archives to be extracted into the working directory of each executor.
deploy-mode	Whether to launch the driver program locally (client) or on one of the worker machines inside the cluster (cluster).
driver-class-path	Classpath entries to pass to the driver. Jar files added with --jars are automatically included in the classpath.
driver-cores	Number of cores used by the driver in Yarn Cluster mode.
driver-java-options	Extra Java options to pass to the driver.
driver-library-path	Extra library path entries to pass to the driver.
driver-memory	Memory for driver, for example, 1000M, 2G. The default value is 512M.
executor-cores	Number of cores per executor. The default value is 1 in YARN mode, or all available cores on the worker in standalone mode.
executor-memory	Memory per executor, for example, 1000M, 2G. The default value is 1G.
jars	Comma-separated list of local jars to include on the driver and executor classpaths.
num-executors	Number of executors to launch. The default value is 2.
odi-execution-mode	ODI execution mode, either SYNC or ASYNC.
properties-file	Path to a file from which to load extra properties. If not specified, this will look for `conf/spark-defaults.conf`.
py-files	Additional python file to execute.
queue	The YARN queue to submit to. The default value is default.
spark-home-dir	Home directory of Spark installation.
spark-web-port	Web port of Spark UI. The default value is 1808.
spark-work-dir	Working directory of ODI Spark mappings that stores the generated python file.
supervise	If configured, restarts the driver on failure (Spark Standalone mode).
total-executor-cores	Total cores for all executors (Spark Standalone mode).
yarn-web-port	Web port of yarn, the default value is 8088.
principal	Kerberized User name.
keytab	Kerberized Password.

Setting Up Spark Data Server

Configuring Big Data technologies using the Big Data Configurations Wizard

6.7 Creating a Spark Physical Schema

Create a Spark physical schema using the standard procedure, as described in the Creating a Physical Schema section in Administering Oracle Data Integrator.

Create for this physical schema a logical schema using the standard procedure, as described in the Creating a Logical Schema section in Administering Oracle Data Integrator and associate it in a given context.

Setting Up Spark Data Server

6.8 Generating Code in Different Languages

Oracle Data Integrator can generate code for multiple languages. For Big Data, this includes HiveQL, Pig Latin, Spark RDD, and Spark DataFrames. The style of code is primarily determined by the choice of the data server used for the staging location of the mapping.

Before you generate code in these languages, ensure that the Hive, Pig, and Spark data servers are set up.

For more information see the following sections:

Setting Up Hive Data Server

Setting Up Pig Data Server

Setting Up Spark Data Server

To generate code in different languages:

Open your mapping.
To generate HiveQL code, run the mapping with the default staging location (Hive).
To generate Pig Latin or Spark code, go to the Physical diagram and do one of the following:
1. To generate Pig Latin code, set the Execute On Hint option to use the Pig data server as the staging location for your mapping.
2. To generate Spark code, set the Execute On Hint option to use the Spark data server as the staging location for your mapping.
Execute the mapping.

Query Processing Engines Supported by Oracle Data Integrator

Generate Code in Different Languages with Oracle Data Integrator