1 Big Data Integration with Oracle Data Integrator
This chapter includes the following sections:
1.1 Overview of Hadoop Data Integration
Oracle Data Integrator combined with Hadoop, can be used to design the integration flow to process huge data from non-relational data sources.
Apache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases.
You can use Oracle Data Integrator to design the 'what' of an integration flow and assign knowledge modules to define the 'how' of the flow in an extensible range of mechanisms. The 'how' is whether it is Oracle, Teradata, Hive, Spark, Pig, etc.
Employing familiar and easy-to-use tools and preconfigured knowledge modules (KMs), Oracle Data Integrator lets you to do the following:
-
Reverse-engineer non-relational and relational data stores like Hive, HBase, and Cassandra.
For more information, see Creating ODI Models and Data Stores to represent Hive, HBase and Cassandra Tables, and HDFS Files.
-
Load data into Hadoop directly from Files or SQL databases.
For more information, see Integrating Hadoop Data.
-
Validate and transform data within Hadoop with the ability to make the data available in various forms such as Hive, HBase, or HDFS.
For more information, see Validating and Transforming Data Within Hive.
-
Load the processed data from Hadoop into Oracle database, SQL database, or Files.
For more information, see Integrating Hadoop Data.
-
Execute integration projects as Oozie workflows on Hadoop.
For more information, see Executing Oozie Workflows with Oracle Data Integrator.
-
Audit Oozie workflow execution logs from within Oracle Data Integrator.
For more information, see Auditing Hadoop Logs.
-
Generate code in different languages for Hadoop, such as HiveQL, Pig Latin, or Spark Python.
For more information, see Generating Code in Different Languages
1.2 Big Data Knowledge Modules Matrix
Big Data Knowledge Modules Matrix depicts the Big Data Loading and Integration KMs that are provided by Oracle Data Integrator.
Depending on the source and target technologies, you can use the KMs shown in the following table in your integration projects. You can also use a combination of these KMs. For example, to read data from SQL into Spark, you can load the data from SQL into Spark first using LKM SQL to Spark
, and then use LKM Spark to HDFS
to continue.
The Big Data knowledge modules that start with LKM File for example, LKM File to SQL SQOOP support both OS File and HDFS File, as described in this matrix. We provide additional KMs, starting with LKM HDFS to Spark, LKM HDFS File to Hive. These support HDFS files only, unlike the other KMs, however, they have additional capabilities, for example, Complex Data can be described in an HDFS data store and used in a mapping using the flatten component.
The following table shows the Big Data Loading and Integration KMs that Oracle Data Integrator provides to integrate data between different source and target technologies.
Table 1-1 Big Data Loading and Integration Knowledge Modules
Source | Target | Knowledge Module |
---|---|---|
OS File |
HDFS File |
NA |
Hive |
||
HBase |
NA |
|
Pig |
||
Spark |
||
SQL |
HDFS File |
|
Hive |
||
HBase |
||
Pig |
||
Spark |
||
HDFS |
Kafka |
NA |
HDFS |
Spark |
|
HDFS File |
OS File |
NA |
SQL |
||
HDFS File |
NA |
|
Hive |
LKM File to Hive LOAD DATA Direct |
|
HBase |
NA |
|
Pig |
||
Spark |
||
Hive |
OS File |
|
SQL |
||
HDFS File |
||
Hive |
||
HBase |
||
Pig |
||
Spark |
||
HBase |
OS File |
NA |
SQL |
||
HDFS File |
NA |
|
Hive |
||
HBase |
NA |
|
Pig |
||
Spark |
NA |
|
Pig |
OS File |
|
HDFS File |
||
Hive |
||
HBase |
||
Pig |
NA |
|
Spark |
NA |
|
Spark |
OS File |
|
SQL |
||
HDFS File |
||
Hive |
||
HBase |
NA |
|
Pig |
NA |
|
Spark |
||
Kafka |
||
Cassandra |
The following table shows the Big Data Reverse Engineering KMs provided by ODI.
Table 1-2 Big Data Reverse-Engineering Knowledge Modules
Technology | Knowledge Module |
---|---|
HBase |
|
Hive |
|
Cassandra |