Big Data Integration with Oracle Data Integrator

1 Big Data Integration with Oracle Data Integrator

This chapter provides an overview of Big Data integration using Oracle Data Integrator. It also provides a compatibility matrix of the supported Big Data technologies.

This chapter includes the following sections:

1.1 Overview of Hadoop Data Integration

Oracle Data Integrator combined with Hadoop, can be used to design the integration flow to process huge data from non-relational data sources.

Apache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases.

You can use Oracle Data Integrator to design the 'what' of an integration flow and assign knowledge modules to define the 'how' of the flow in an extensible range of mechanisms. The 'how' is whether it is Oracle, Teradata, Hive, Spark, Pig, etc.

Employing familiar and easy-to-use tools and preconfigured knowledge modules (KMs), Oracle Data Integrator lets you to do the following:

Reverse-engineer non-relational and relational data stores like Hive, HBase, and Cassandra.

For more information, see Creating ODI Models and Data Stores to represent Hive, HBase and Cassandra Tables, and HDFS Files.
Load data into Hadoop directly from Files or SQL databases.

For more information, see Integrating Hadoop Data.
Validate and transform data within Hadoop with the ability to make the data available in various forms such as Hive, HBase, or HDFS.

For more information, see Validating and Transforming Data Within Hive.
Load the processed data from Hadoop into Oracle database, SQL database, or Files.

For more information, see Integrating Hadoop Data.
Execute integration projects as Oozie workflows on Hadoop.

For more information, see Executing Oozie Workflows with Oracle Data Integrator.
Audit Oozie workflow execution logs from within Oracle Data Integrator.

For more information, see Auditing Hadoop Logs.
Generate code in different languages for Hadoop, such as HiveQL, Pig Latin, or Spark Python.

For more information, see Generating Code in Different Languages

1.2 Big Data Knowledge Modules Matrix

Big Data Knowledge Modules Matrix depicts the Big Data Loading and Integration KMs that are provided by Oracle Data Integrator.

Depending on the source and target technologies, you can use the KMs shown in the following table in your integration projects. You can also use a combination of these KMs. For example, to read data from SQL into Spark, you can load the data from SQL into Spark first using LKM SQL to Spark, and then use LKM Spark to HDFS to continue.

The Big Data knowledge modules that start with LKM File for example, LKM File to SQL SQOOP support both OS File and HDFS File, as described in this matrix. We provide additional KMs, starting with LKM HDFS to Spark, LKM HDFS File to Hive. These support HDFS files only, unlike the other KMs, however, they have additional capabilities, for example, Complex Data can be described in an HDFS data store and used in a mapping using the flatten component.

The following table shows the Big Data Loading and Integration KMs that Oracle Data Integrator provides to integrate data between different source and target technologies.

Table 1-1 Big Data Loading and Integration Knowledge Modules

Source	Target	Knowledge Module
OS File	HDFS File	NA
	Hive	LKM File to Hive LOAD DATA Direct
	HBase	NA
	Pig	LKM File to Pig
	Spark	LKM File to Spark
SQL	HDFS File	LKM SQL to File SQOOP Direct
	Hive	LKM SQL to Hive SQOOP
	HBase	LKM SQL to HBase SQOOP Direct
	Pig	LKM SQL to Pig SQOOP
	Spark	LKM SQL to Spark
HDFS	Kafka	NA
HDFS	Spark	LKM HDFS to Spark
HDFS File	OS File	NA
	SQL	LKM File to SQL SQOOP LKM File to Oracle OLH-OSCH Direct
	HDFS File	NA
	Hive	LKM File to Hive LOAD DATA Direct LKM HDFS File to Hive Load Data LKM HDFS File to Hive Load Data (Direct)
	HBase	NA
	Pig	LKM File to Pig
	Spark	LKM HDFS to Spark
Hive	OS File	LKM Hive to File Direct
	SQL	LKM Hive to SQL SQOOP LKM Hive to Oracle OLH-OSCH Direct
	HDFS File	LKM Hive to File Direct
	Hive	IKM Hive Append IKM Hive Incremental Update
	HBase	LKM Hive to HBase Incremental Update HBASE-SERDE Direct
	Pig	LKM Hive to Pig
	Spark	LKM Hive to Spark
HBase	OS File	NA
	SQL	LKM HBase to SQL SQOOP
	HDFS File	NA
	Hive	LKM HBase to Hive HBASE-SERDE
	HBase	NA
	Pig	LKM HBase to Pig
	Spark	NA
Pig	OS File	LKM Pig to File
	HDFS File	LKM Pig to File
	Hive	LKM Pig to Hive
	HBase	LKM Pig to HBase
	Pig	NA
	Spark	NA
Spark	OS File	LKM Spark to File
	SQL	LKM Spark to SQL
	HDFS File	LKM Spark to File LKM Spark to HDFS
	Hive	LKM Spark to Hive
	HBase	NA
	Pig	NA
	Spark	IKM Spark Table Function
	Kafka	LKM Spark to Kafka
	Cassandra	LKM Spark to Cassandra

The following table shows the Big Data Reverse Engineering KMs provided by ODI.

Table 1-2 Big Data Reverse-Engineering Knowledge Modules

Technology	Knowledge Module
HBase	RKM HBase
Hive	RKM Hive
Cassandra	RKM Cassandra