2 Oracle Shell for Hadoop Loaders
Oracle Big Data Connectors are a powerful toolset for data interchange between Hadoop and Oracle Database. Oracle Shell for Hadoop Loaders (OHSH) is an ease-of-use tool for using Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, and the Copy to Hadoop feature of Big Data SQL.
What is Oracle Shell for Hadoop Loaders?
Oracle Shell for Hadoop Loaders is a helper shell that provides an easy-to-use command line interface to Oracle Loader for Apache Hadoop, Oracle SQL Connector for HDFS, and the Copy to Hadoop feature of Big Data SQL. It has basic shell features such as command line recall, history, inheriting environment variables from the parent process, setting new or existing environment variables, and performing environmental substitution in the command line.
The core functionality of Oracle Shell for Hadoop Loaders includes the following:
-
Defining named external resources with which Oracle Shell for Hadoop Loaders interacts to perform loading tasks.
-
Setting default values for load operations.
-
Running load commands.
-
Delegating simple pre and post load tasks to the Operating System, HDFS, Hive and Oracle. These tasks include viewing the data to be loaded, and viewing the data in the target table after loading.
See Also:
The examples directory in the OHSH kit contains many examples that define
resources and load data using Oracle Shell for Hadoop Loaders. Unzip
<OHSH_KIT>/examples.zip and
see<OHSH_KIT>/examples/README.txt for a description
of the examples and instructions on how to run OHSH load methods.
Note:
TheHADOOP_CLASSPATH
environment variable should be cleared before invoking Oracle Shell for Hadoop
Loaders.
Configure Oracle Shell for Hadoop Loaders (OHSH)
Note:
For instructions on configuring OHSH to run on Oracle Big Data Service, see Use Big Data Connectors and Copy to Hadoop to Copy Data Between Big Data Service and a Database Instance in Using Oracle Big Data Service.
For OHSH on a Hadoop cluster or edge node, or on the Oracle Database
System, confirm the location of the OHSH install and set OHSH_HOME.
For example:
Get Started with Oracle Shell for Hadoop Loaders
Starting an OHSH Interactive Session
To start an
interactive session, enter ohsh on the command
line. Remember to clear HADOOP_CLASSPATH before
invoking OHSH. This brings you to the OHSH shell (if you have OHSH
in your
path):
$ ohsh
ohsh>You can execute OHSH commands in this shell (using the OHSH syntax). You can also execute commands for Beeline/Hive, Hadoop, Bash, and SQL*Plus. For non-OHSH commands, you add a delegation operator prefix (“%”) to the name of the resource used to execute the command. For example:
ohsh> %bash0 ls
—lScripting OHSH
You can also script the same commands that work in the CLI.
The ohsh command provides three parameters for
working with scripts.
-
ohsh —i <filename>.ohshThe
—iparameter tells OHSH to initialize an interactive session with the commands in the script before the prompt appears. This is a useful way to set up the required session resources and automate other preliminary tasks before you start working within the shell.$ ohsh –i initresources.ohsh -
ohsh —f <filename>.ohshThe
ohshcommand with the—fparameter starts a non-interactive session and runs the commands in the script.$ ohsh –f myunattendedjobs.ohsh -
ohsh —i —f <filename>.ohshYou can use
—iand—ftogether to initialize a non-interactive session and then run another script in the session.$ ohsh -i mysetup.ohsh –f myunattendedjobs.ohsh -
ohsh —cThis command dumps all Hadoop configuration properties that an OHSH session inherits at start up.
Working With OHSH Resources
A resource is some named entity that OHSH interacts with. For example: a Hadoop cluster is a resource, as is a JDBC connection to an Oracle database, a Hive database, a SQL session with an Oracle database, and a Bash shell on the local OS.
OHSH provides two default resources at start up:
hive0 (to connect to the
default Hive database) and
hadoop0.
-
Using
hive0resource to execute a Hive command:ohsh> %hive0 show tablesYou can create additional Hive resources to connect to other Hive databases.
-
Using the
hadoop0resource to execute a Hadoop command:ohsh> %hadoop0 fs -ls
Within an interactive or scripted session, you can create instances of additional resources, such as SSL, SQL, and JDBC. You need to create these three resources in order to connect to Oracle Database through OHSH.
- Creating a SQL
resource:
ohsh> create sqlplus resource sql0 connectid=”bigdatalite.localdomain:1521/orcl”Note:
At the prompt, enter database username and password. - Creating a JDBC
resource:
ohsh> create jdbc resource jdbc0 connectid=<database connection URL>Note:
At the prompt, enter database username and password. - Showing
resources:
ohsh> show resources
This command lists default resources and any additional resources created within the session.
Creating Resources Using JDBC SSL to Connect Oracle Database
You need to create these resources in order to connect to Oracle Database using SSL.
- Creating a SSL resource . See Using JDBC SSL to download the SSL client wallet and complete the configuration steps.
- Creating a SQL
resource:
ohsh> create sqlplus resource sql_ssl connectid="<tns entry name for the SSL connection>"For example:ohsh> create sqlplus resource sql_ssl connectid="inst1_ssl"inst1_sslis the TNS entry for the JDBC SSL connectionNote:
At the prompt, enter database username and password. - Creating a JDBC
resource:
ohsh> create jdbc resource jdbc_ssl connectiondir=<SSL wallet directory location> connectid="<tns entry name for the SSL connection>"For example:ohsh> create jdbc resource ora_mydatabase_cs connectiondir=/home/oracle/ssl_client_wallet connectid="inst1_ssl"inst1_sslis the TNS entry for the JDBC SSL connectionNote:
At the prompt, enter database username and password.
Creating Resources Using Secure Java KeyStore
You can store passwords in the secure Java KeyStore. If you use the Java KeyStore, then you won't be prompted for the username and password. You can also add this to the scripts you develop to load data.
- Creating a Java KeyStore. See Using Secure External Java KeyStore and Hadoop credential command
- Creating a SQL resource Using a Java
KeyStore:
ohsh> create sqlplus resource sql_ssl_cs user=<username> passwordalias=<password alias> provider="<provider_path>" connectid="<tns entry name for the SSL connection>"For example:ohsh> create sqlplus resource sql_ssl_cs user=oracle passwordalias=oracle_passwd provider="jceks://file/home/oracle/passwd.jceks" connectid="inst1" - Creating a SQL resource Using Java KeyStore with
JDBC SSL
connection:
ohsh> create sqlplus resource sql_ssl_cs user=<username> passwordalias=<password alias> provider="<provider_path>" connectiondir=<SSL wallet directory location> connectid="<tns entry name for the SSL connection>For example:ohsh> create sqlplus resource sql_ssl_cs user=oracle passwordalias=oracle_passwd provider="jceks://file/home/oracle/passwd.jceks" connectiondir=/home/oracle/ssl_client_wallet connectid="inst1_ssl" - Creating a JDBC resource Using a Java
KeyStore:
ohsh> create jdbc resource jdbc_ssl_cs user=<username> passwordalias=<password alias> provider="<provider_path>" connectid="<tns entry name for the SSL connection>"For example:ohsh> create jdbc resource jdbc_ssl_cs user=oracle passwordalias=oracle_passwd provider="jceks://file/home/oracle/passwd.jceks" connectid="inst1" - Creating a JDBC resource Using Java KeyStore
with JDBC SSL
connection:
ohsh> create jdbc resource jdbc_ssl_cs user=<username> passwordalias=<password alias> provider="<provider_path>" connectiondir=<SSL wallet directory location> connectid="<tns entry name for the SSL connection>"For example:ohsh> create jdbc resource jdbc_ssl_cs user=oracle passwordalias=oracle_passwd provider="jceks://file/home/oracle/passwd.jceks" connectiondir=/home/oracle/ssl_client_wallet connectid="inst1_ssl"
Getting Help
The OHSH shell provides online help for all commands.
To get a list of all OHSH commands:
ohsh> help
To get help on a specific command, enter
help, followed by the command:
ohsh> help show
The table below describes the help categories available.
| Help Command | Description |
|---|---|
help
load |
Describes load commands for Oracle and Hadoop tables. |
help
set |
Shows help for setting defaults for load operations. It also describes what load methods are impacted by a particular setting. |
help
show |
Shows help for inspecting default settings. |
help
shell |
Shows shell-like commands. |
help
resource |
Show commands for creating and dropping named resources. |
Load an Oracle Database Table
You can use the OHSH load command to load data from HDFS or Hive into an Oracle Database table using OLH or OSCH or to load data from an Oracle Database table into Hive using Copy to Hadoop.
The following load options are available for loading from HDFS or Hive into Oracle Database:
-
jdbc(the default for Oracle tables that are not partitioned) - load from HDFS or Hive using the OLH JDBC Output Format -
directpath(the default for partitioned Oracle tables) - load from HDFS or Hive using the OLH OCI Direct Path Output Format -
exttab- load from HDFS or Hive using OSCH -
etl- load CSV content in HDFS using OLH and OSCH together. OLH extracts and transforms the delimited content in HDFS files into datapump files in HDFS. OSCH then loads the datapump files from HDFS into a target table in Oracle -
etl deferred- Use OLH to transform the delimited content in HDFS files into data pump files in HDFS. The data pump files are left there for OSCH to load at a later time. The OHSH script for doing the load is generated in the local example directory and can be manually executed to complete the load
The following load options are available for loading from an Oracle Database table into Hive:
directcopy(default) - Uses the Copy to Hadoop feature of Oracle Big Data SQL to connect to Oracle Database and copy data from an Oracle table into a Datapump file stored in HDFSstage- Connects to the Oracle Database and generates Datapump format files containing the table data and metadata. Copies the datapump file from the Database to HDFS. Then uses the Copy to Hadoop SerDes to enable Hive to read the datapump file
See the Using Oracle Shell for Hadoop Loaders With Copy to Hadoop in Oracle Big Data SQL User's Guide for more information about using OHSH with Copy to Hadoop.
-
Use the
loadcommand to load files from HDFS into a target table in the Oracle database.The following command loads data from a delimited text file in HDFS <
HDFS path> into the target table in Oracle Database using the direct path option.ohsh> load oracle table ora_mydatabase:<target table in the Oracle database> from path hadoop0:/user/<HDFS path> using directpathNote:
The default direct path method is the fastest way to load a table. However, it requires partitioned target table. Direct path is always recommended for use with partition tables. Use the JDBC option to load into a non-partitioned target table.
If the command does not explicitly state the load method, then OHSH automatically uses the appropriate method. If the target Oracle table is partitioned, then by default, OHSH uses direct path (i.e. Oracle OCI). If the Oracle table is not partitioned, it uses JDBC.
-
After loading, check the number of rows.
You can do this conveniently from the OHSH command line:
ohsh> %sql0 select count(*) from <target table in Oracle Database>
Loading a Hive Table Into an Oracle Database Table
You can use OHSH to load a Hive table into a target table in an Oracle database. The command below shows how to do this using the direct path method.
ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name>
Note that if the target table is partitioned, then OHSH
uses direct path automatically. You do not need to enter
using directpath explicitly in the
command.
If the target table is non-partitioned, then specify the JDBC method instead:
ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using jdbc
Note:
Theload command assumes that the column names
in the Hive table and in the Oracle Database table are identically
matched. If they do not match, then use OHSH
loadermap.
Using OHSH Loadermaps
The simple load examples in this section assume the following:
-
Where we load data from a text file in Hadoop into an Oracle Database table, the declared order of columns in the target table maps correctly to the physical ordering of the delimited text fields in the file.
-
Where we load Hive tables in to Oracle Database tables, the Hive and Oracle Database column names are identically matched.
However, in less straightforward cases where the column
names (or the order of column names and delimited text fields) do
not match, use the OHSH loadermap construct to
correct these mismatches.
You can also use a loadermap to specify a subset of target columns to load into table or in the case of a load from a text file, specify the format of a field in the load.
Loadermaps are not covered in this introduction.
Performance Tuning Oracle Loader for Hadoop in OHSH
Aside from network bandwidth, two factors can have significant impact on Oracle Loader for Hadoop performance. You can tune both in OHSH.
-
Degree of parallelism
The degree of parallelism affects performance when Oracle Loader for Hadoop runs in Hadoop. For the default method (direct path), parallelism is determined by the number of reducer tasks. The higher the number of reducer tasks, the faster the performance. The default value is 4. To set the number of tasks:
ohsh> set reducetasks 18For the JDBC option, parallelism is determined by the number of map tasks and the optimal number is determined automatically. However, remember that if the target table is partitioned, direct path is faster than JDBC.
-
Load balancing
Performance is best when the load is balanced evenly across reduce tasks. The load is detected by sampling. Sampling is always enabled by default for loads using the JDBC and the default copy method.
Debugging in OHSH
Several OHSH settings control the availability of debugging information:
-
outputlevelThe
outputlevelis set tominimalby default. Set it toverbosein order to return a stack trace when a command fails:ohsh> set outputlevel verbose -
logbadrecordsohsh> set logbadrecords trueThis is set to
trueby default.
These log files are informative for debugging:
-
Oracle Loader for Hadoop log files.
/user/<username>/smartloader/jobhistory/oracle/<target table schema>/<target table name>/<OHSH job ID>/_olh -
Log files generated by the map and reduce tasks.
Other OHSH Properties That are Useful for Oracle Loader for Hadoop
You can set these properties on the OHSH command line or in a script. These values persist for the current session only.
-
dateformatohsh> set dateformat “yyyy-MM-dd HH:mm:ss”The syntax for this command is dictated by the Java date format.
-
rejectlimitThe number of rows that can be rejected before the load of a delimited text file fails.
-
fieldterminatorThe field terminator in loads of delimited text files.
-
hadooptnsadminLocation of an Oracle TNS admin directory in the Hadoop cluster
-
hadoopwalletlocationLocation of the Oracle Wallet directory in the Hadoop cluster.
Using the exttab (External Table) Method to Load Data
A third option to load data from Hadoop into Oracle Database is exttab.
Note:
The exttab option is available in on-premises deployments of OHSH only. It is not available in Oracle cloud servicesIn the exttab, data is loaded via external tables. OHSH
creates the external table using Oracle SQL Connector for HDFS, and
then uses a Create table as Select statement to
load the data into the target table:
ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using exttab
Learning Resources
These OHSH blog entries can help you get started.
See the Java™ Platform, Standard Edition 7 API Specification for documentation on the SimpleDateFormat class.