B Pig Knowledge Modules
This appendix provides information about the Pig knowledge modules.
This appendix includes the following sections:
B.1 LKM File to Pig
This KM loads data from a file into Pig.
The supported data formats are:
-
Delimited
-
JSON
-
Pig Binary
-
Text
-
Avro
-
Trevni
-
Custom
Data can be loaded and written to local file system or HDFS.
The following table describes the options for LKM File to Pig.
Table B-1 LKM File to Pig
Option | Description |
---|---|
Storage Function |
The storage function to be used to load data. Select the storage function to be used to load data. |
Schema for Complex Fields |
The pig schema for simple/complex fields separated by comma (,). Redefine the datatypes of the fields in pig schema format. This option primarily allows to overwrite the default datatypes conversion for data store attributes, for example: PO_NO:int,PO_TOTAL:long MOVIE_RATING:{(RATING:double,INFO:chararray)}, where the names of the fields defined here should match with the attributes names of the data store. |
Function Class |
Fully qualified name of the class to be used as storage function to load data. Specify the fully qualified name of the class to be used as storage function to load data. |
Function Parameters |
The parameters required for the custom function. Specify the parameters that the loader function expects. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema where, MusicStore - the root element of the xml movie - The element that wraps the child elements such as id, name, etc. Third Argument is the representation of data in pig schema. The names of the parameters are arbitrary and there can be any number of parameters. |
Options |
Additional options required for the storage function Specify additional options required for the storage function. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) The last argument options can be specified as -namespace com.imdb -encoding utf8 |
Jars |
The jar containing the storage function class and dependent libraries separated by colon (:). Specify the jar containing the storage function class and dependent libraries separated by colon (:). |
Storage Convertor |
The converter that provides functions to cast from bytearray to each of Pig's internal types. Specify the converter that provides functions to cast from bytearray to each of Pig's internal types. The supported converter is Utf8StorageConverter. |
B.2 LKM Pig to File
This KM unloads data to file from pig.
The supported data formats are:
-
Delimited
-
JSON
-
Pig Binary
-
Text
-
Avro
-
Trevni
-
Custom
Data can be stored in local file system or in HDFS.
The following table describes the options for LKM Pig to File.
Table B-2 LKM Pig to File
Option | Description |
---|---|
Storage Function |
The storage function to be used to load data. Select the storage function to be used to load data. |
Store Schema |
If selected, stores the schema of the relation using a hidden JSON file. |
Record Name |
The Avro record name to be assigned to the bag of tuples being stored. Specify a name to be assigned to the bag of tuples being stored. |
Namespace |
The namespace to be assigned to Avro/Trevni records, while storing data. Specify a namespace for the bag of tuples being stored. |
Delete Target File |
Delete target file before Pig writes to the file. If selected, the target file is deleted before storing data. This option effectively enables the target file to be overwritten. |
Function Class |
Fully qualified name of the class to be used as storage function to load data. Specify the fully qualified name of the class to be used as storage function to load data. |
Function Parameters |
The parameters required for the custom function. Specify the parameters that the loader function expects. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema where, MusicStore - the root element of the xml movie - The element that wraps the child elements such as id, name, etc. Third Argument is the representation of data in pig schema. The names of the parameters are arbitrary and there can be any number of parameters. |
Options |
Additional options required for the storage function Specify additional options required for the storage function. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) The last argument options can be specified as -namespace com.imdb -encoding utf8 |
Jars |
The jar containing the storage function class and dependent libraries separated by colon (:). Specify the jar containing the storage function class and dependent libraries separated by colon (:). |
Storage Convertor |
The converter that provides functions to cast from bytearray to each of Pig's internal types. Specify the converter that provides functions to cast from bytearray to each of Pig's internal types. The supported converter is Utf8StorageConverter. |
B.3 LKM HBase to Pig
This KM loads data from a HBase table into Pig using HBaseStorage function.
The following table describes the options for LKM HBase to Pig.
Table B-3 LKM HBase to Pig
Option | Description |
---|---|
Storage Function |
The storage function to be used to load data. HBaseStorage is used to load from a HBase table into pig. |
Load Row Key |
Load the row key as the first value in every tuple returned from HBase. If selected, Loads the row key as the first value in every tuple returned from HBase. The row key is mapped to the 'key' column of the HBase data store in ODI. |
Greater Than Min Key |
Loads rows with key greater than the key specified for this option. Specify the key value to load rows with key greater than the specified key value. |
Less Than Min Key |
Loads rows with row key less than the value specified for this option. Specify the key value to load rows with key less than the specified key value. |
Greater Than Or Equal Min Key |
Loads rows with key greater than or equal to the key specified for this option. Specify the key value to load rows with key greater than or equal to the specified key value. |
Less Than Or Equal Min Key |
Loads rows with row key less than or equal to the value specified for this option. Specify the key value to load rows with key less than or equal to the specified key value. |
Limit Rows |
Maximum number of row to retrieve per region Specify the maximum number of rows to retrieve per region. |
Cached Rows |
Number of rows to cache. Specify the number of rows to cache. |
Storage Convertor |
The name of Caster to use to convert values. Specify the class name of Caster to use to convert values. The supported values are HBaseBinaryConverter and Utf8StorageConverter. If unspecified, the default value is Utf8StorageConverter. |
Column Delimiter |
The delimiter to be used to separate columns in the columns list of HBaseStorage function. Specify the delimiter to be used to separate columns in the columns list of HBaseStorage function. If unspecified, the default is whitespace. |
Timestamp |
Return cell values that have a creation timestamp equal to this value. Specify a timestamp to return cell values that have a creation timestamp equal to the specified value. |
Min Timestamp |
Return cell values that have a creation timestamp less than to this value. Specify a timestamp to return cell values that have a creation timestamp less than to the specified value. |
Max Timestamp |
Return cell values that have a creation timestamp less than this value. Specify a timestamp to return cell values that have a creation timestamp greater than or equal to the specified value. |
B.4 LKM Pig to HBase
This KM stores data into a HBase table using HBaseStorage function.
The following table describes the options for LKM Pig to HBase.
Table B-4 LKM Pig to HBase
Option | Description |
---|---|
Storage Function |
The storage function to be used to store data. This is a read-only option, which cannot be changed. HBaseStore function is used to load data into HBase table. |
Storage Convertor |
The name of Caster to use to convert values. Specify the class name of Caster to use to convert values. The supported values are HBaseBinaryConverter and Utf8StorageConverter. If unspecified, the default value is Utf8StorageConverter. |
Column Delimiter |
The delimiter to be used to separate columns in the columns list of HBaseStorage function. Specify the delimiter to be used to separate columns in the columns list of HBaseStorage function. If unspecified, the default is whitespace. |
Disable Write Ahead Log |
If it is true, write ahead log is set to false for faster loading into HBase. If selected, write ahead log is set to false for faster loading into HBase. This must be used in extreme caution, since this could result in data loss. Default value is false. |
B.5 LKM Hive to Pig
This KM loads data from a Hive table into Pig using HCatalog.
The following table describes the options for LKM Hive to Pig.
Table B-5 LKM Hive to Pig
Option | Description |
---|---|
Storage Function |
The storage function to be used to load data. This is a read-only option, which cannot be changed. HCatLoader is used to load data from a hive table. |
B.6 LKM Pig to Hive
This KM stores data into a hive table using HCatalog.
The following table describes the options for LKM Pig to Hive.
Table B-6 LKM Pig to Hive
Option | Description |
---|---|
Storage Function |
The storage function to be used to load data. This is a read-only option, which cannot be changed. HCatStorer is used to store data into a hive table. |
Partition |
The new partition to be created. Represents key/value pairs for partition. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted. |
B.7 LKM SQL to Pig SQOOP
This KM integrates data from a JDBC data source into Pig.
It executes the following steps:
-
Create a SQOOP configuration file, which contains the upstream query.
-
Execute SQOOP to extract the source data and import into Staging file in csv format.
-
Runs LKM File To Pig KM to load the Staging file into PIG.
-
Drop the Staging file.
The following table describes the options for LKM SQL to Pig SQOOP.
Table B-7 LKM SQL to Pig SQOOP
Option | Description |
---|---|
STAGING_FILE_DELIMITER |
Sqoop uses this delimiter to create the temporary file. If not specified, \\t will be used. |
Storage Function |
The storage function to be used to load data. Select the storage function to be used to load data. |
Schema for Complex Fields |
The pig schema for simple/complex fields separated by comma (,). Redefine the datatypes of the fields in pig schema format. This option primarily allows to overwrite the default datatypes conversion for data store attributes, for example: PO_NO:int,PO_TOTAL:long MOVIE_RATING:{(RATING:double,INFO:chararray)}, where the names of the fields defined here should match with the attributes names of the data store. |
Function Class |
Fully qualified name of the class to be used as storage function to load data. Specify the fully qualified name of the class to be used as storage function to load data. |
Function Parameters |
The parameters required for the custom function. Specify the parameters that the loader function expects. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) Here the first three arguments are parameters, which can be specified as -rootElement MovieStore -tableName movie -schema where, MusicStore - the root element of the xml movie - The element that wraps the child elements such as id, name, etc. Third Argument is the representation of data in pig schema. The names of the parameters are arbitrary and there can be any number of parameters. |
Options |
Additional options required for the storage function. Specify additional options required for the storage function. For example, the XMLLoader function may look like XMLLoader('MusicStore', 'movie', 'id:double, name:chararray, director:chararry', options) The last argument options can be specified as -namespace com.imdb -encoding utf8 |
Jars |
The jar containing the storage function class and dependent libraries separated by colon (:). Specify the jar containing the storage function class and dependent libraries separated by colon (:). |
Storage Convertor |
The converter that provides functions to cast from bytearray to each of Pig's internal types. Specify the converter that provides functions to cast from bytearray to each of Pig's internal types. The supported converter is Utf8StorageConverter. |
B.8 XKM Pig Aggregate
Summarize rows, for example using SUM and GROUP BY.
The following table describes the options for XKM Pig Aggregate.
Table B-8 XKM Pig Aggregate
Option | Description |
---|---|
USING_ALGORITHM |
Aggregation type; collected or merge. |
PARTITION_BY |
Specify the Hadoop partitioner. |
PARTITIONER_JAR |
Increase the parallelism of this job. |
PARALLEL_NUMBER |
Increase the parallelism of this job. |
Note:
When mapping has Pig staging, i.e when processing is done with Pig, and there is aggregator component in the Pig staging area, the clause must be set differently than in regular mappings for SQL based technologies.B.12 XKM Pig Flatten
Un-nest the complex data according to the given options.
The following table describes the options for XKM Pig Flatten.
Table B-9 XKM Pig Flatten
Option | Description |
---|---|
Default Expression |
Default expression for null nested table objects, for example, rating_table(obj_rating('-1', 'Unknown')). This is used to return a row with default values for each null nested table object. |
B.13 XKM Pig Join
Joins more than one input sources based on the join condition.
The following table describes the options for XKM Pig Join.
Table B-10 XKM Pig Join
Option | Description |
---|---|
USING_ALGORITHM |
Join type; replicated or skewed or merge. |
PARTITION_BY |
Specify the Hadoop partitioner. |
PARTITIONER_JAR |
Increase the parallelism of this job. |
PARALLEL_NUMBER |
Increase the parallelism of this job. |
B.14 XKM Pig Lookup
Lookup data for a driving data source.
The following table describes the options for XKM Pig Lookup.
Table B-11 XKM Pig Lookup
Option | Description |
---|---|
Jars |
The jar containing the Used Defined Function classes and dependant libraries separated by colon (:). |
B.20 XKM Pig Table Function
Pig table function access.
The following table descriptions the options for XKM Pig Table Function.
Table B-12 XKM Pig Table Function
Option | Description |
---|---|
PIG_SCRIPT_CONTENT |
User specified pig script content. |