8.2.17 Flat Files
Oracle GoldenGate for Big Data supports writing data files to a local file system with File Writer Handler.
- File Writer Handler
You can use the File Writer Handler and the event handlers to transform data. - Optimized Row Columnar (ORC)
The Optimized Row Columnar (ORC) Event Handler to generate data files is in ORC format. - Parquet
Learn how to use the Parquet load files generated by the File Writer Handler into HDFS.
Parent topic: Target
8.2.17.1 File Writer Handler
You can use the File Writer Handler and the event handlers to transform data.
The File Writer Handler supports generating data files in delimited text, XML, JSON, Avro, and Avro Object Container File formats. It is intended to fulfill an extraction, load, and transform use case. Data files are staged on your local file system. Then when writing to a data file is complete, you can use a third party application to read the file to perform additional processing.
The File Writer Handler also supports the event handler framework. The event handler framework allows data files generated by the File Writer Handler to be transformed into other formats, such as Optimized Row Columnar (ORC) or Parquet. Data files can be loaded into third party applications, such as HDFS or Amazon s3. The event handler framework is extensible allowing more event handlers performing different transformations or loading to different targets to be developed. Additionally, you can develop a custom event handler for your Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) environment.
GG for DAA provides two handlers to write to HDFS. Oracle recommends that you use the HDFS Handler or the File Writer Handler in the following situations:
- The HDFS Handler is designed to stream data directly to HDFS.
-
Use when no post write processing is occurring in HDFS. The HDFS Handler does not change the contents of the file, it simply uploads the existing file to HDFS.
Use when analytical tools are accessing data written to HDFS in real time including data in files that are open and actively being written to.
- The File Writer Handler is designed to stage data to the local file system and then to load completed data files to HDFS when writing for a file is complete.
-
Analytic tools are not accessing data written to HDFS in real time.
Post write processing is occurring in HDFS to transform, reformat, merge, and move the data to a final location.
You want to write data files to HDFS in ORC or Parquet format.
- Detailing the Functionality
- Configuring the File Writer Handler
- Stopping the File Writer Handler
- Review a Sample Configuration
- File Writer Handler Partitioning
Partitioning functionality had been added to the File Writer Handler in Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) 21.1. The partitioning functionality uses the template mapper functionality to resolve partitioning strings. The result is that you are afforded control in how to partition source trail data.
Parent topic: Flat Files
8.2.17.1.1 Detailing the Functionality
- Using File Roll Events
- Automatic Directory Creation
- About the Active Write Suffix
- Maintenance of State
Parent topic: File Writer Handler
8.2.17.1.1.1 Using File Roll Events
A file roll event occurs when writing to a specific data file is completed. No more data is written to that specific data file.
Finalize Action Operation
You can configure the finalize action operation to clean up a specific data file after a successful file roll action using the finalizeaction
parameter with the following options:
-
none
-
Leave the data file in place (removing any active write suffix, see About the Active Write Suffix).
-
delete
-
Delete the data file (such as, if the data file has been converted to another format or loaded to a third party application).
-
move
-
Maintain the file name (removing any active write suffix), but move the file to the directory resolved using the
movePathMappingTemplate
property. -
rename
-
Maintain the current directory, but rename the data file using the
fileRenameMappingTemplate
property. -
move-rename
-
Rename the file using the file name generated by the
fileRenameMappingTemplate
property and move the file the file to the directory resolved using themovePathMappingTemplate
property.
Typically, event handlers offer a subset of these same actions.
A sample Configuration of a finalize action operation:
gg.handlerlist=filewriter
#The File Writer Handler
gg.handler.filewriter.type=filewriter
gg.handler.filewriter.mode=op
gg.handler.filewriter.pathMappingTemplate=./dirout/evActParamS3R
gg.handler.filewriter.stateFileDirectory=./dirsta
gg.handler.filewriter.fileNameMappingTemplate=${fullyQualifiedTableName}_${currentTimestamp}.txt
gg.handler.filewriter.fileRollInterval=7m
gg.handler.filewriter.finalizeAction=delete
gg.handler.filewriter.inactivityRollInterval=7m
File Rolling Actions
Any of the following actions trigger a file roll event.
-
A metadata change event.
-
The maximum configured file size is exceeded
-
The file roll interval is exceeded (the current time minus the time of first file write is greater than the file roll interval).
-
The inactivity roll interval is exceeded (the current time minus the time of last file write is greater than the file roll interval).
-
The File Writer Handler is configured to roll on shutdown and the Replicat process is stopped.
Operation Sequence
The file roll event triggers a sequence of operations to occur. It is important that you understand the order of the operations that occur when an individual data file is rolled:
-
The active data file is switched to inactive, the data file is flushed, and state data file is flushed.
-
The configured event handlers are called in the sequence that you specified.
-
The finalize action is executed on all the event handlers in the reverse order in which you configured them. Any finalize action that you configured is executed.
-
The finalize action is executed on the data file and the state file. If all actions are successful, the state file is removed. Any finalize action that you configured is executed.
For example, if you configured the File Writer Handler with the Parquet Event Handler and then the S3 Event Handler, the order for a roll event is:
-
The active data file is switched to inactive, the data file is flushed, and state data file is flushed.
-
The Parquet Event Handler is called to generate a Parquet file from the source data file.
-
The S3 Event Handler is called to load the generated Parquet file to S3.
-
The finalize action is executed on the S3 Parquet Event Handler. Any finalize action that you configured is executed.
-
The finalize action is executed on the Parquet Event Handler. Any finalize action that you configured is executed.
-
The finalize action is executed for the data file in the File Writer Handler
Parent topic: Detailing the Functionality
8.2.17.1.1.2 Automatic Directory Creation
Parent topic: Detailing the Functionality
8.2.17.1.1.3 About the Active Write Suffix
A common use case is using a third party application to monitor the write directory to read data files. Third party application can only read a data file when writing to that file has completed. These applications need a way to determine if writing to a data file is active or complete. The File Writer Handler allows you to configure an active write suffix using this property:
gg.handler.name.fileWriteActiveSuffix=.tmp
The value of this property is appended to the generated file name. When writing to the file is complete, the data file is renamed and the active write suffix is removed from the file name. You can set your third party application to monitor your data file names to identify when the active write suffix is removed.
Parent topic: Detailing the Functionality
8.2.17.1.1.4 Maintenance of State
Previously, all Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) handlers have been stateless. These stateless handlers only maintain state in the context of the Replicat process that it was running. If the Replicat process was stopped and restarted, then all the state was lost. With a Replicat restart, the handler began writing with no contextual knowledge of the previous run.
The File Writer Handler provides the ability of maintaining state between invocations of the Replicat process. By default with a restart:
-
the state saved files are read,
-
the state is restored,
-
and appending active data files continues where the previous run stopped.
You can change this default action to require all files be rolled on shutdown by setting this property:
gg.handler.name.rollOnShutdown=true
Parent topic: Detailing the Functionality
8.2.17.1.2 Configuring the File Writer Handler
Lists the configurable values for the File Writer Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the File Writer Handler, you must first configure the
handler type by specifying gg.handler.name.type=filewriter
and the other File Writer properties as follows:
Table 8-19 File Writer Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the File Writer Handler for use. |
|
Optional |
Default unit of measure is bytes. You can stipulate |
1g |
Sets the maximum file size of files generated by the File Writer Handler. When the file size is exceeded, a roll event is triggered. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File rolling on time is off. |
The timer starts when a file is created. If the file is still open when the interval elapses then the a file roll event will be triggered. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File inactivity rolling is turned off. |
The timer starts from the latest write to a generated file. New writes to a generated file restart the counter. If the file is still open when the timer elapses a roll event is triggered.. |
|
Required |
A string with resolvable keywords and constants used to dynamically generate File Writer Handler data file names at runtime. |
None |
Use keywords interlaced with constants to dynamically generate unique file names at
runtime. Typically, file names follow the format, |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the directory to which a file is written. |
None |
Use keywords interlaced with constants to dynamically generate unique path names at
runtime. Typically, path names follow the format,
|
|
Optional |
A string. |
None |
An optional suffix that is appended to files generated by the File Writer Handler to indicate that writing to the file is active. At the finalize action the suffix is removed. |
|
Required |
A directory on the local machine to store the state files of the File Writer Handler. |
None |
Sets the directory on the local machine to store the state files of the File Writer Handler. The group name is appended to the directory to ensure that the functionality works when operating in a coordinated apply environment. |
|
Optional |
|
|
Set to |
|
Optional |
|
|
Indicates what the File Writer Handler should do at the finalize action.
|
|
Optional |
|
|
Set to |
|
Optional |
|
No event handler configured. |
A unique string identifier cross referencing an event handler. The event handler will be invoked on the file roll event. Event handlers can do thing file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS. |
|
Required if |
A string with resolvable keywords and constants used to dynamically generate File Writer Handler data file names for file renaming in the finalize action. |
None. |
Use keywords interlaced with constants to dynamically generate unique file names at
runtime. Typically, file names follow the format,
|
|
Required if |
A string with resolvable keywords and constants used to dynamically generate the directory to which a file is written. |
None |
Use keywords interlaced with constants to dynamically generate a unique path names at
runtime. Typically, path names typically follow the format,
|
|
Required |
|
|
Selects the formatter for the HDFS Handler for how output data will be formatted
If you want to use the Parquet or ORC Event Handlers, then the selected format must be |
|
Optional |
An even number of hex characters. |
None |
Enter an even number of hex characters where every two characters correspond to a single byte in the byte order mark (BOM). For example, the string |
|
Optional |
|
|
Set to |
|
Optional |
Any string |
new line ( |
Allows you to control the delimiter separating file names in the control file. You can use |
|
Optional |
A path to a directory to hold the control file. |
A period ( |
Set to specify where you want to write the control file. |
|
Optional |
|
|
Set to |
|
Optional |
One or more times to trigger a roll action of all open files. |
None |
Configure one or more trigger times in the following format: HH:MM,HH:MM,HH:MM Entries are based on a 24 hour clock. For example, an entry to configure rolled actions at three discrete times of day is: gg.handler.fw.atTime=03:30,21:00,23:51 |
|
Optional |
no compression. |
|
Enables the corresponding compression algorithm for generated Avro
OCF files. The corresponding compression library must be added to
the |
|
Optional |
|
Positive Integer >= 512 |
Sets the size the |
gg.handler.name.rollOnTruncate |
Optional | true | false |
false |
Controls whether the occurrence of truncate operation
causes a rollover of the corresponding data file by the handler. The
default is false , which means the corresponding data
file is not rolled when a truncate operation is presented. Set to
true to roll the data file on a truncate operation.
To propagate truncate operations, ensure to set the replicat property
GETTRUNCATES .
|
gg.handler.name.logEventHandlerStatus |
Optional | true | false |
false |
When set to true , it logs the status of
completed event handlers at the info logging level. Can be used for
debugging and troubleshooting of the event handlers.
|
gg.handler.name.eventHandlerTimeoutMinutes |
Optional | Long integer | 120 | The event handler thread timeout in minutes. The event handler threads spawned by the file writer handler are provided a max execution time to complete their work. If the timeout value is exceeded, then Replicat assumes that the Event handler thread is hung and will ABEND. For stage and merge use cases, Event handler threads may take longer to complete their work. The default value is set to 120 (2 hours). |
gg.handler.name.processBacklogOnShutdown |
Optional |
true | false |
false |
Set to true to force the replicat to
process all of the outstanding staged files through the event handler
framework. Recommend setting to true for initial load
replication to data warehouse targets. Recommend setting to
true for simple data format conversion and/or load
to cloud storage scenarios. Recommend setting to false
for CDC replication to data warehouse targets as merges can take long
periods of time.
|
Parent topic: File Writer Handler
8.2.17.1.3 Stopping the File Writer Handler
- Force stop should never be executed on the replicat process.
- The Unix kill command should never be used to kill the replicat process.
An inconsistent state may mean that the replicat process will abend on startup and require manual removal of state files.
ERROR 2022-07-11 19:05:23.000367 [main]- Failed to restore state for UUID [d35f117f-ffab-4e60-aa93-f7ef860bf280] table name [QASOURCE.TCUSTORD] data file name [QASOURCE.TCUSTORD_2022-07-11_19-04-27.900.txt]
.state
file has not yet
been removed. Three scenarios can generally cause this problem:
- The replicat process was force stopped, was killed using
the kill command, or crashed while it was in the
processing window between when the data file was
removed and when the associated
.state
file was removed. - The user has manually removed the data file or files but
left the associated
.state
file in place. - There are two instances of the same replicat process running. A lock file is created to prevent this, but there is a window on replicat startup which allows multiple instances of a replicat process to be started.
If this problem occurs, then you should manually determine
whether or not the data file associated with the
.state
file has been successfully
processed. If the data has been successfully processed, then you can
manually remove the .state
file and restart the
replicat process.
If data file associated with the problematic .state
file
has been determined not to have been processed, then do the
following:
- Delete all the
.state
files. - Alter the
seqno
andrba
of the replicat process to back it up to a period for which it was known that processing successfully occurred. - Restart the replicat process to reprocess the data.
Parent topic: File Writer Handler
8.2.17.1.4 Review a Sample Configuration
This File Writer Handler configuration example is using the Parquet Event Handler to convert data files to Parquet, and then for the S3 Event Handler to load Parquet files into S3:
gg.handlerlist=filewriter #The handler properties gg.handler.name.type=filewriter gg.handler.name.mode=op gg.handler.name.pathMappingTemplate=./dirout gg.handler.name.stateFileDirectory=./dirsta gg.handler.name.fileNameMappingTemplate=${fullyQualifiedTableName}_${currentTimestamp}.txt gg.handler.name.fileRollInterval=7m gg.handler.name.finalizeAction=delete gg.handler.name.inactivityRollInterval=7m gg.handler.name.format=avro_row_ocf gg.includeggtokens=true gg.handler.name.partitionByTable=true gg.handler.name.eventHandler=parquet gg.handler.name.rollOnShutdown=true gg.eventhandler.parquet.type=parquet gg.eventhandler.parquet.pathMappingTemplate=./dirparquet gg.eventhandler.parquet.writeToHDFS=false gg.eventhandler.parquet.finalizeAction=delete gg.eventhandler.parquet.eventHandler=s3 gg.eventhandler.parquet.fileNameMappingTemplate=${tableName}_${currentTimestamp}.parquet gg.handler.filewriter.eventHandler=s3 gg.eventhandler.s3.type=s3 gg.eventhandler.s3.region=us-west-2 gg.eventhandler.s3.proxyServer=www-proxy.us.oracle.com gg.eventhandler.s3.proxyPort=80 gg.eventhandler.s3.bucketMappingTemplate=tomsfunbucket gg.eventhandler.s3.pathMappingTemplate=thepath gg.eventhandler.s3.finalizeAction=none
Parent topic: File Writer Handler
8.2.17.1.5 File Writer Handler Partitioning
Partitioning functionality had been added to the File Writer Handler in Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) 21.1. The partitioning functionality uses the template mapper functionality to resolve partitioning strings. The result is that you are afforded control in how to partition source trail data.
All of the keywords that are supported by the templating functionality are now supported in File Writer Handler partitioning.
- File Writer Handler Partitioning Precondition
In order to use the partitioning functionality, data must first be partitioned by table. The following configuration cannot be set:gg.handler.filewriter.partitionByTable=false
. - Path Configuration
Assume that the path mapping template is configured as follows:gg.handler.filewriter.pathMappingTemplate=/ogg/${fullyQualifiedTableName}
. At runtime the path resolves as follows for theDBO.ORDERS
source table:/ogg/DBO.ORDERS
. - Partitioning Configuration
Any of the keywords that are legal for templating are now legal for partitioning:gg.handler.filewriter.partitioner.fully qualified table name=templating keywords and/or constants
. - Partitioning Effect on Event Handler
The resolved partitioning path is carried forward to the corresponding Event Handlers as well.
Parent topic: File Writer Handler
8.2.17.1.5.1 File Writer Handler Partitioning Precondition
In order to use the partitioning functionality, data must first be
partitioned by table. The following configuration cannot be set:
gg.handler.filewriter.partitionByTable=false
.
Parent topic: File Writer Handler Partitioning
8.2.17.1.5.2 Path Configuration
Assume that the path mapping template is configured as follows:
gg.handler.filewriter.pathMappingTemplate=/ogg/${fullyQualifiedTableName}
.
At runtime the path resolves as follows for the DBO.ORDERS
source table:
/ogg/DBO.ORDERS
.
Parent topic: File Writer Handler Partitioning
8.2.17.1.5.3 Partitioning Configuration
Any of the keywords that are legal for templating are now legal for
partitioning: gg.handler.filewriter.partitioner.fully qualified table
name=templating keywords and/or constants
.
Partitioning for the
DBO.ORDERS
table is set to the following:
gg.handler.filewriter.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}
This example can result in the following breakdown of files on the file system:
/ogg/DBO.ORDERS/par_sales_region=west/data files /ogg/DBO.ORDERS/par_sales_region=east/data files /ogg/DBO.ORDERS/par_sales_region=north/data files /ogg/DBO.ORDERS/par_sales_region=south/data fileExample 2
Partitioning for the DBO.ORDERS
table
is set to the
following:
gg.handler.filewriter.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}/par_state=${columnValue[STATE]}
This example can result in the following breakdown of files on the file system: /ogg/DBO.ORDERS/par_sales_region=west/par_state=CA/data files /ogg/DBO.ORDERS/par_sales_region=east/par_state=FL/data files /ogg/DBO.ORDERS/par_sales_region=north/par_state=MN/data files /ogg/DBO.ORDERS/par_sales_region=south/par_state=TX/data files
Caution:
Ensure to be extra vigilant while configuring partitioning. Choosing partitioning column values that have a very large range of data values result in partitioning to a proportional number of output data files.Parent topic: File Writer Handler Partitioning
8.2.17.1.5.4 Partitioning Effect on Event Handler
The resolved partitioning path is carried forward to the corresponding Event Handlers as well.
If partitioning is configured as
follows:
gg.handler.filewriter.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}
,
then the partition string might resolve to the following:
par_sales_region=west par_sales_region=east par_sales_region=north par_sales_region=southExample 2
If S3 Event handler is used, then the path mapping
template of the S3 Event Handler is configured as follows:
gg.eventhandler.s3.pathMappingTemplate=output/dir
. The target
directories in S3 are as
follows:
output/dir/par_sales_region=west/data files output/dir/par_sales_region=east/data files output/dir/par_sales_region=north/data files output/dir/par_sales_region=south/data files
Parent topic: File Writer Handler Partitioning
8.2.17.2 Optimized Row Columnar (ORC)
The Optimized Row Columnar (ORC) Event Handler to generate data files is in ORC format.
This topic describes how to use the ORC Event Handler.
- Overview
- Detailing the Functionality
- Configuring the ORC Event Handler
- Optimized Row Columnar Event Handler Client Dependencies
What are the dependencies for the Optimized Row Columnar (OCR) Handler?
Parent topic: Flat Files
8.2.17.2.1 Overview
ORC is a row columnar format that can substantially improve data retrieval times and the performance of Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) analytics. You can use the ORC Event Handler to write ORC files to either a local file system or directly to HDFS. For information, see https://orc.apache.org/.
Parent topic: Optimized Row Columnar (ORC)
8.2.17.2.2 Detailing the Functionality
Parent topic: Optimized Row Columnar (ORC)
8.2.17.2.2.1 About the Upstream Data Format
The ORC Event Handler can only convert Avro Object Container File (OCF) generated by the File Writer Handler. The ORC Event Handler cannot convert other formats to ORC data files. The format of the File Writer Handler must be avro_row_ocf
or avro_op_ocf
, see Flat Files.
Parent topic: Detailing the Functionality
8.2.17.2.2.2 About the Library Dependencies
Generating ORC files requires both the Apache ORC libraries and the HDFS client libraries, see Optimized Row Columnar Event Handler Client Dependencies and HDFS Handler Client Dependencies.
Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) does not
include the Apache ORC libraries nor does it include the HDFS client libraries. You
must configure the gg.classpath
variable to include the dependent
libraries.
Parent topic: Detailing the Functionality
8.2.17.2.2.3 Requirements
The ORC Event Handler can write ORC files directly to HDFS. You must set the writeToHDFS
property to true
:
gg.eventhandler.orc.writeToHDFS=true
Ensure that the directory containing the HDFS core-site.xml
file is in gg.classpath
. This is so the core-site.xml
file can be read at runtime and the connectivity information to HDFS can be resolved. For example:
gg.classpath=/{HDFS_install_directory}/etc/hadoop
If you enable Kerberos authentication is on the HDFS cluster, you have to configure the Kerberos principal and the location of the keytab
file so that the password can be resolved at runtime:
gg.eventHandler.name.kerberosPrincipal=principal
gg.eventHandler.name.kerberosKeytabFile=path_to_the_keytab_file
Parent topic: Detailing the Functionality
8.2.17.2.3 Configuring the ORC Event Handler
You configure the ORC Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
The ORC Event Handler works only in conjunction with the File Writer Handler.
To enable the selection of the ORC Handler, you must first configure the handler
type by specifying gg.eventhandler.name.type=orc
and the
other ORC properties as follows:
Table 8-20 ORC Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the ORC Event Handler. |
|
Optional |
|
|
The ORC framework allows direct writing to HDFS. Set to |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path in the ORC bucket to write the file. |
None |
Use keywords interlaced with constants to dynamically generate unique ORC path names
at runtime. Typically, path names follow the format,
|
|
Optional |
A string with resolvable keywords and constants used to dynamically generate the ORC file name at runtime. |
None |
Use resolvable keywords and constants used to dynamically generate the ORC data file name at runtime. If not set, the upstream file name is used. See Template Keywords. |
|
Optional |
|
|
Sets the compression codec of the generated ORC file. |
|
Optional |
|
|
Set to |
|
Optional |
The Kerberos principal name. |
None |
Sets the Kerberos principal when writing directly to HDFS and Kerberos authentication is enabled. |
|
Optional |
The path to the Keberos |
|
Sets the path to the Kerberos |
|
Optional |
|
|
Set to |
|
Optional |
|
The ORC default. |
Sets the block size of generated ORC files. |
|
Optional |
|
The ORC default. |
Sets the buffer size of generated ORC files. |
|
Optional |
|
The ORC default. |
Set if the ORC encoding strategy is optimized for compression or for speed.. |
|
Optional |
A percentage represented as a floating point number. |
The ORC default. |
Sets the percentage for padding tolerance of generated ORC files. |
|
Optional |
|
The ORC default. |
Sets the row index stride of generated ORC files. |
|
Optional |
|
The ORC default. |
Sets the stripe size of generated ORC files. |
|
Optional |
A unique string identifier cross referencing a child event handler. |
No event handler configured. |
The event handler that is invoked on the file roll event. Event handlers can do file roll event actions like loading files to S3 or HDFS. |
|
Optional |
The false positive probability must be greater than
zero and less than one. For example, |
The Apache ORC default. |
Sets the false positive probability of the querying of a bloom filter index and the result indicating that the value being searched for is in the block, but the value is actually not in the block. needs to set which tables to set bloom filters and on which columns. The user selects on which tables and columns to set bloom filters with the following configuration syntax: gg.eventhandler.orc.bloomFilter.QASOURCE.TCUSTMER=CUST_CODE gg.eventhandler.orc.bloomFilter.QASOURCE.TCUSTORD=CUST_CODE,ORDER_DATE
|
|
Optional |
|
|
Sets the version of the ORC bloom filter. |
Parent topic: Optimized Row Columnar (ORC)
8.2.17.2.4 Optimized Row Columnar Event Handler Client Dependencies
What are the dependencies for the Optimized Row Columnar (OCR) Handler?
The maven central repository artifacts for ORC are:
Maven groupId: org.apache.orc
Maven atifactId: orc-core
Maven version: 1.6.9
The Hadoop client dependencies are also required for the ORC Event Handler, see Hadoop Client Dependencies.
8.2.17.2.4.1 ORC Client 1.6.9
aircompressor-0.19.jar annotations-17.0.0.jar commons-lang-2.6.jar commons-lang3-3.12.0.jar hive-storage-api-2.7.1.jar jaxb-api-2.2.11.jar orc-core-1.6.9.jar orc-shims-1.6.9.jar protobuf-java-2.5.0.jar slf4j-api-1.7.5.jar threeten-extra-1.5.0.jar
Parent topic: Optimized Row Columnar Event Handler Client Dependencies
8.2.17.2.4.2 ORC Client 1.5.5
aircompressor-0.10.jar asm-3.1.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-httpclient-3.1.jar commons-io-2.1.jar commons-lang-2.6.jar commons-logging-1.1.1.jar commons-math-2.1.jar commons-net-3.1.jar guava-11.0.2.jar hadoop-annotations-2.2.0.jar hadoop-auth-2.2.0.jar hadoop-common-2.2.0.jar hadoop-hdfs-2.2.0.jar hive-storage-api-2.6.0.jar jackson-core-asl-1.8.8.jar jackson-mapper-asl-1.8.8.jar jaxb-api-2.2.11.jar jersey-core-1.9.jar jersey-server-1.9.jar jsch-0.1.42.jar log4j-1.2.17.jar orc-core-1.5.5.jar orc-shims-1.5.5.jar protobuf-java-2.5.0.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar xmlenc-0.52.jar zookeeper-3.4.5.jar
Parent topic: Optimized Row Columnar Event Handler Client Dependencies
8.2.17.2.4.3 ORC Client 1.4.0
aircompressor-0.3.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar asm-3.1.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.6.0.jar curator-framework-2.6.0.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.6.4.jar hadoop-auth-2.6.4.jar hadoop-common-2.6.4.jar hive-storage-api-2.2.1.jar htrace-core-3.0.4.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jdk.tools-1.6.jar jersey-core-1.9.jar jersey-server-1.9.jar jsch-0.1.42.jar log4j-1.2.17.jar netty-3.7.0.Final.jar orc-core-1.4.0.jar protobuf-java-2.5.0.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: Optimized Row Columnar Event Handler Client Dependencies
8.2.17.3 Parquet
Learn how to use the Parquet load files generated by the File Writer Handler into HDFS.
See Flat Files.
- Parquet Handler
- Detailing the Functionality
- Configuring the Parquet Event Handler
- Parquet Event Handler Client Dependencies
What are the dependencies for the Parquet Event Handler?
Parent topic: Flat Files
8.2.17.3.1 Parquet Handler
The Parquet Event Handler enables you to generate data files in Parquet format. Parquet files can be written to either the local file system or directly to HDFS. Parquet is a columnar data format that can substantially improve data retrieval times and improve the performance of Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) analytics, see https://parquet.apache.org/.
Parent topic: Parquet
8.2.17.3.2 Detailing the Functionality
Parent topic: Parquet
8.2.17.3.2.1 Configuring the Parquet Event Handler to Write to HDFS
The Apache Parquet framework supports writing directly to HDFS. The Parquet Event Handler can write Parquet files directly to HDFS. These additional configuration steps are required:
The Parquet Event Handler dependencies and considerations are the same as the HDFS Handler, see HDFS Additional Considerations.
Set the writeToHDFS
property to true
:
gg.eventhandler.parquet.writeToHDFS=true
Ensure that gg.classpath
includes the HDFS client libraries.
Ensure that the directory containing the HDFS core-site.xml
file is in gg.classpath
. This is so the core-site.xml
file can be read at runtime and the connectivity information to HDFS can be resolved. For example:
gg.classpath=/{HDFS_install_directory}/etc/hadoop
If Kerberos authentication is enabled on the HDFS cluster, you have to configure the Kerberos principal and the location of the keytab
file so that the password can be resolved at runtime:
gg.eventHandler.name.kerberosPrincipal=principal
gg.eventHandler.name.kerberosKeytabFile=path_to_the_keytab_file
Parent topic: Detailing the Functionality
8.2.17.3.2.2 About the Upstream Data Format
The Parquet Event Handler can only convert Avro Object Container File (OCF) generated by the File Writer Handler. The Parquet Event Handler cannot convert other formats to Parquet data files. The format of the File Writer Handler must be avro_row_ocf
or avro_op_ocf
, see Flat Files.
Parent topic: Detailing the Functionality
8.2.17.3.3 Configuring the Parquet Event Handler
You configure the Parquet Event Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
The Parquet Event Handler works only in conjunction with the File Writer Handler.
To enable the selection of the Parquet Event Handler, you must first configure the
handler type by specifying gg.eventhandler.name.type=parquet
and the other Parquet Event properties as follows:
Table 8-21 Parquet Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the Parquet Event Handler for use. |
|
Optional |
|
|
Set to |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path to write generated Parquet files. |
None |
Use keywords interlaced with constants to dynamically generate unique path names at
runtime. Typically, path names follow the format,
|
|
Optional |
A string with resolvable keywords and constants used to dynamically generate the Parquet file name at runtime |
None |
Sets the Parquet file name. If not set, the upstream file name is used. See Template Keywords. |
|
Optional |
|
|
Sets the compression codec of the generated Parquet file. |
|
Optional |
|
|
Indicates what the Parquet Event Handler should do at the finalize action. |
|
Optional |
|
The Parquet default. |
Set to |
|
Optional |
|
The Parquet default. |
Set to |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet dictionary page size. |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet padding size. |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet page size. |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet row group size. |
|
Optional |
The Kerberos principal name. |
None |
Set to the Kerberos principal when writing directly to HDFS and Kerberos authentication is enabled. |
|
Optional |
The path to the Keberos |
The Parquet default. |
Set to the path to the Kerberos |
|
Optional |
A unique string identifier cross referencing a child event handler. |
No event handler configured. |
The event handler that is invoked on the file roll event. Event handlers can do file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS. |
gg.eventhandler.name.writerVersion |
Optional | v1|v2 |
The Parquet library default
which is up through Parquet version 1.11.0 is
v1 .
|
Allows the ability to set the Parquet writer version. |
Parent topic: Parquet
8.2.17.3.4 Parquet Event Handler Client Dependencies
What are the dependencies for the Parquet Event Handler?
The maven central repository artifacts for Parquet are:
Maven groupId: org.apache.parquet
Maven atifactId: parquet-avro
Maven version: 1.9.0
Maven groupId: org.apache.parquet
Maven atifactId: parquet-hadoop
Maven version: 1.9.0
The Hadoop client dependencies are also required for the Parquet Event Handler, see Hadoop Client Dependencies.
Parent topic: Parquet
8.2.17.3.4.1 Parquet Client 1.12.0
audience-annotations-0.12.0.jar avro-1.10.1.jar commons-compress-1.20.jar commons-pool-1.6.jar jackson-annotations-2.11.3.jar jackson-core-2.11.3.jar jackson-databind-2.11.3.jar javax.annotation-api-1.3.2.jar parquet-avro-1.12.0.jar parquet-column-1.12.0.jar parquet-common-1.12.0.jar parquet-encoding-1.12.0.jar parquet-format-structures-1.12.0.jar parquet-hadoop-1.12.0.jar parquet-jackson-1.12.0.jar slf4j-api-1.7.22.jar snappy-java-1.1.8.jar zstd-jni-1.4.9-1.jar
Parent topic: Parquet Event Handler Client Dependencies
8.2.17.3.4.2 Parquet Client 1.11.1
audience-annotations-0.11.0.jar avro-1.9.2.jar commons-compress-1.19.jar commons-pool-1.6.jar jackson-annotations-2.10.2.jar jackson-core-2.10.2.jar jackson-databind-2.10.2.jar javax.annotation-api-1.3.2.jar parquet-avro-1.11.1.jar parquet-column-1.11.1.jar parquet-common-1.11.1.jar parquet-encoding-1.11.1.jar parquet-format-structures-1.11.1.jar parquet-hadoop-1.11.1.jar parquet-jackson-1.11.1.jar slf4j-api-1.7.22.jar snappy-java-1.1.7.3.jar
Parent topic: Parquet Event Handler Client Dependencies
8.2.17.3.4.3 Parquet Client 1.10.1
avro-1.8.2.jar commons-codec-1.10.jar commons-compress-1.8.1.jar commons-pool-1.6.jar fastutil-7.0.13.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar paranamer-2.7.jar parquet-avro-1.10.1.jar parquet-column-1.10.1.jar parquet-common-1.10.1.jar parquet-encoding-1.10.1.jar parquet-format-2.4.0.jar parquet-hadoop-1.10.1.jar parquet-jackson-1.10.1.jar slf4j-api-1.7.2.jar snappy-java-1.1.2.6.jar xz-1.5.jar
Parent topic: Parquet Event Handler Client Dependencies
8.2.17.3.4.4 Parquet Client 1.9.0
avro-1.8.0.jar commons-codec-1.5.jar commons-compress-1.8.1.jar commons-pool-1.5.4.jar fastutil-6.5.7.jar jackson-core-asl-1.9.11.jar jackson-mapper-asl-1.9.11.jar paranamer-2.7.jar parquet-avro-1.9.0.jar parquet-column-1.9.0.jar parquet-common-1.9.0.jar parquet-encoding-1.9.0.jar parquet-format-2.3.1.jar parquet-hadoop-1.9.0.jar parquet-jackson-1.9.0.jar slf4j-api-1.7.7.jar snappy-java-1.1.1.6.jar xz-1.5.jar
Parent topic: Parquet Event Handler Client Dependencies