3 Setting Up the Environment for Integrating Big Data
This chapter includes the following sections:
3.1 Configuring Big Data technologies using the Big Data Configurations Wizard
The Big Data Configurations wizard provides a single entry point to set up multiple Hadoop technologies. You can quickly create data servers, physical schema, logical schema, and set a context for different Hadoop technologies such as Hadoop File System or HDFS, HBase, Oozie, Spark, Hive, Pig, etc
The default metadata for different distributions, such as properties, host names, port numbers, etc., and default values for environment variables are pre-populated for you. This helps you to easily create the data servers along with the physical and logical schema, without having in-depth knowledge about these technologies.
After all the technologies are configured, you can validate the settings against the data servers to test the connection status.
Note:
If you do not want to use the Big Data Configurations wizard, you can set up the data servers for the Big Data technologies manually using the information mentioned in the subsequent sections.
To run the Big Data Configurations Wizard:
3.1.1 General Settings
The following table describes the options that you need to set on the General Settings panel of the Big Data Configurations wizard.
Table 3-1 General Settings Options
Option | Description |
---|---|
Prefix |
Specify a prefix. This prefix is attached to the data server name, logical schema name, and physical schema name. |
Distribution |
Select a distribution, either Manual or Cloudera Distribution for Hadoop (CDH) <version>. |
Base Directory |
Specify the directory location where CDH is installed. This base directory is automatically populated in all other panels of the wizard. Note: This option appears only if the distribution is other than Manual. |
Distribution Type |
Select a distribution type, either Normal or Kerberized. |
Technologies |
Select the technologies that you want to configure. Note: Data server creation panels are displayed only for the selected technologies. |
3.1.2 HDFS Data Server Definition
The following table describes the options that you must specify to create a HDFS data server.
Note:
Only the fields required or specific for defining a HDFS data server are described.Table 3-2 HDFS Data Server Definition
Option | Description |
---|---|
Name |
Type a name for the data server. This name appears in Oracle Data Integrator. |
User/Password |
HDFS currently does not implement User/Password security. Leave this option blank. |
Hadoop Data Server |
Hadoop data server that you want to associate with the HDFS data server. |
Additional Classpath |
Specify additional jar files to the classpath if needed. |
3.1.3 HBase Data Server Definition
The following table describes the options that you must specify to create an HBase data server.
Note: Only the fields required or specific for defining a HBase data server are described.
Table 3-3 HBase Data Server Definition
Option | Description |
---|---|
Name |
Type a name for the data server. This name appears in Oracle Data Integrator. |
HBase Quorum |
ZooKeeper Quorum address in hbase-site.xml . For example, |
User/Password |
HBase currently does not implement User/Password security. Leave these fields blank. |
Hadoop Data Server |
Hadoop data server that you want to associate with the HBase data server. |
Additional Classpath |
Specify any additional classes/jar files to be added. The following classpath entries will be built by the Base Directory value:
|
3.1.4 Kafka Data Server Definition
The following table describes the options that you must specify to create a Kafka data server.
Note:
Only the fields required or specific for defining a Kafka data server are described.Table 3-4 Kafka Data Server Definition
Option | Description |
---|---|
Name |
Type a name for the data server. |
User/Password |
User name with its password. |
Hadoop Data Server |
Hadoop data server that you want to associate with the Kafka data server. If Kafka is not running on the Hadoop server, then there is no need to specify a Hadoop Data Server. This option is useful when Kafka runs on its own server. |
Additional Classpath |
Specify any additional classes/jar files to be added. The following classpath entries will be built by the Base Directory value:
If required, you can add more additional classpaths. If Kafka is not running on the Hadoop server, then specify the absolute path of Kafka libraries in this field. Note: This field appears only when you are creating the Kafka Data Server using the Big Data Configuration wizard. |
3.1.5 Kafka Data Server Properties
The following table describes the Kafka data server properties that you need to add on the Properties tab when creating a new Kafka data server.
Table 3-5 Kafka Data Server Properties
Key | Value |
---|---|
metadata.broker.list |
This is a comma separated list of Kafka metadata brokers. Each broker is defined by |
oracle.odi.prefer.dataserver.packages |
Retrieves the topic and message from Kafka server. The address is scala, kafka, oracle.odi.kafka.client.api.impl, org.apache.log4j. |
security.protocol |
Protocol used to communicate with brokers. Valid values are: PLAINTEXT, SSL, SASL_PLAINTEXT, and SASL_SSL. |
zookeeper.connect |
Specifies the ZooKeeper connection string in the form |
3.2 Creating and Initializing the Hadoop Data Server
Configure the Hadoop Data Server Definitions and Properties, to create and initialize Hadoop Data Server.
To create and initialize the Hadoop data server:
3.2.1 Hadoop Data Server Definition
The following table describes the fields that you must specify on the Definition tab when creating a new Hadoop data server.
Note: Only the fields required or specific for defining a Hadoop data server are described.
Table 3-6 Hadoop Data Server Definition
Field | Description |
---|---|
Name |
Name of the data server that appears in Oracle Data Integrator. |
Data Server |
Physical name of the data server. |
User/Password |
Hadoop user with its password. If password is not provided, only simple authentication is performed using the username on HDFS and Oozie. |
Authentication Method |
Select one of the following authentication methods:
Note: The following link helps determine if the Hadoop cluster is secured: |
HDFS Node Name URI |
URI of the HDFS node name.
|
Resource Manager/Job Tracker URI |
URI of the resource manager or the job tracker.
|
ODI HDFS Root |
Path of the ODI HDFS root directory.
|
Additional Class Path |
Specify additional classpaths. Add the following additional classpaths:
|
3.2.2 Hadoop Data Server Properties
The following table describes the properties that you can configure in the Properties tab when defining a new Hadoop data server.
Note:
By default, only the oracle.odi.prefer.dataserver.packages
property is displayed. Click the + icon to add the other properties manually.
Table 3-7 Hadoop Data Server Properties Mandatory for Hadoop and Hive
Property Group | Property | Description/Value |
---|---|---|
General |
HADOOP_HOME |
Location of Hadoop dir. For example, |
User Defined |
HADOOP_CONF |
Location of Hadoop configuration files such as core-default.xml, core-site.xml, and hdfs-site.xml. For example, |
Hive |
HIVE_HOME |
Location of Hive dir. For example, |
User Defined |
HIVE_CONF |
Location of Hive configuration files such as hive-site.xml. For example, |
General |
HADOOP_CLASSPATH |
|
General |
HADOOP_CLIENT_OPTS |
|
Hive |
HIVE_SESSION_JARS |
|
Table 3-8 Hadoop Data Server Properties Mandatory for HBase (In addition to base Hadoop and Hive Properties)
Property Group | Property | Description/Value |
---|---|---|
HBase |
HBASE_HOME |
Location of HBase dir. For example, |
General |
HADOOP_CLASSPATH |
|
Hive |
HIVE_SESSION_JARS |
$HBASE_HOME/hbase.jar:$HBASE_HOME/lib/hbase-sep-api-*.jar:$HBASE_HOME/lib/hbase-sep-impl-*hbase*.jar:/$HBASE_HOME/lib/hbase-sep-impl-common-*.jar:/$HBASE_HOME/lib/hbase-sep-tools-*.jar:$HIVE_HOME/lib/hive-hbase-handler-*.jar Note: Follow the steps for Hadoop Security models, such as Apache Sentry, to allow the Hive ADD JAR call used inside ODI Hive KMs:
|
Table 3-9 Hadoop Data Server Properties Mandatory for Oracle Loader for Hadoop (In addition to base Hadoop and Hive properties)
Property Group | Property | Description/Value |
---|---|---|
OLH/OSCH |
OLH_HOME |
Location of OLH installation. For example, |
OLH/OSCH |
OLH_FILES |
|
OLH/OSCH |
ODCH_HOME |
Location of OSCH installation. For example, |
General |
HADOOP_CLASSPATH |
|
OLH/OSCH |
OLH_JARS |
Comma-separated list of all JAR files required for custom input formats, Hive, Hive SerDes, and so forth, used by Oracle Loader for Hadoop. All filenames have to be expanded without wildcards. For example:
|
OLH/OSCH |
OLH_SHAREDLIBS (deprecated) |
|
Table 3-10 Hadoop Data Server Properties Mandatory for SQOOP (In addition to base Hadoop and Hive properties)
Property Group | Property | Description/Value |
---|---|---|
SQOOP |
SQOOP_HOME |
Location of Sqoop directory. For example, |
SQOOP |
SQOOP_LIBJARS |
Location of the SQOOP library jars. For example, |
3.3 Creating a Hadoop Physical Schema
To create a physical schema for Hadoop, first create a logical schema for the same using the standard procedure.
Create a Hadoop physical schema using the standard procedure, as described in the Creating a Physical Schema section in Administering Oracle Data Integrator.
Create for this physical schema a logical schema using the standard procedure, as described in the Creating a Logical Schema section in Administering Oracle Data Integrator and associate it in a given context.
3.4 Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs
You must configure the Oracle Data Integrator agent to execute Hadoop jobs.
For information on creating a physical agent, see the Creating a Physical Agent section in Administering Oracle Data Integrator.
To configure the Oracle Data Integrator agent:
3.5 Configuring Oracle Loader for Hadoop
If you want to use Oracle Loader for Hadoop, you must install and configure Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.
Oracle Loader for Hadoop is an efficient and high-performance loader for fast loading of data from a Hadoop cluster into a table in an Oracle database.
To install and configure Oracle Loader for Hadoop:
3.6 Configuring Oracle Data Integrator to Connect to a Secure Cluster
To run the Oracle Data Integrator agent on a Hadoop cluster that is protected by Kerberos authentication, you must configure a Kerberos-secured cluster.
To use a Kerberos-secured cluster:
3.7 Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent
Perform the following configuration steps to execute Hadoop jobs on the local agent of Oracle Data Integrator Studio.
For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the configuration steps in Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs with the following change:
Copy the following Hadoop client jar files to the local machines.
/usr/lib/hadoop/*.jar /usr/lib/hadoop/lib/*.jar /usr/lib/hadoop/client/*.jar /usr/lib/hadoop-hdfs/*.jar /usr/lib/hadoop-mapreduce/*.jar /usr/lib/hadoop-yarn/*.jar /usr/lib/oozie/lib/*.jar /usr/lib/hive/*.jar /usr/lib/hive/lib/*.jar /usr/lib/hbase/*.jar /usr/lib/hbase/lib/*.jar
Add the above classpaths in the additional_path.txt
file under the userlib
directory.
For example:
Linux: $USER_HOME/.odi/oracledi/userlib
directory.
Windows: C:\Users\<USERNAME>\AppData\Roaming\odi\oracledi\userlib
directory