8.3 Realtime Parquet Ingestion into Google Cloud Storage with Oracle GoldenGate for Distributed Applications and Analytics
Overview
This Quickstart covers a step-by-step process showing how to ingest parquet files into Google Cloud Storage buckets in real-time with GoldenGate for Distributed Applications and Analytics (GG for DAA).
Google Cloud Storage (GCS) is a service for storing objects in Google Cloud Platform.GG for DAA GCS handler works in conjunction with File Writer Handler and Parquet Handler (if parquet is required). File Writer Handler produces files locally, optionally Parquet Handler converts to parquet format and GCS Handler loads into GCS buckets.
- Prerequisites
- Install Dependency Files
- Create a Replicat in Oracle GoldenGate for Distributed Applications and Analytics
Parent topic: Quickstarts
8.3.1 Prerequisites
To successfully complete this quickstart, you must have the following:
- Google Cloud Storage Bucket
- Google Service Account Key with Bucket and Object Permissions
- Public access to your bucket (GG for DAA supports private bucket access). For more information, see Google Cloud Storage.
In this Quickstart, a sample trail file (named tr) which is
shipped with GG for DAA is being used. The sample trail file, it is located at
GG_HOME/opt/AdapterExamples/trail/
in your GG for DAA
instance.
8.3.2 Install Dependency Files
Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) uses client libraries in the replication process and these libraries need to be downloaded before setting up the replication process. You can use dependency downloader to download the client libraries. Dependency Downloader is a set of shell scripts that downloads dependency jar files from Maven and other repositories.
- Generating local files from trail files
- Converting local files to Parquet format
- Loading files into GCS
For generating local parquet files with GG for DAA, replicat uses File Writer Handler and Parquet Handler. To load the parquet files into GCS, GG for DAA uses Google Cloud Storage Handler in conjunction with File Writer and Parquet Event Handler.
To install the required dependency files:
- In your GG for DAA VM, go to dependency downloader utility. It is
located at
GG_HOME/opt/DependencyDownloader/
. - Run
parquet.sh
,hadoop.sh
, andgcs.sh
with the required versions. You can check the version and reported vulnerabilities in Maven Central.Figure 8-11 3 directories created in GG_HOME/opt/DependencyDownloader/dependencies
- Three new directories get created in
GG_HOME/opt/DependencyDownloader/dependencies
named as<dependencyname_version>
. Make a note of these directoies as they get used in the replicat properties. For example:/u01/app/ogg/opt/DependencyDownloader/dependencies/gcs_12.29.1
, and/u01/app/ogg/opt/DependencyDownloader/dependencies/hadoop_3.4.0
8.3.3 Create a Replicat in Oracle GoldenGate for Distributed Applications and Analytics
To create a replicat in Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA):
- In the GG for DAA UI, in the Administration Service tab,
click the + sign to add a replicat.
Figure 8-12 Click the Administration Service tab
- Select the Classic Replicat Replicat Type and click
Next. There are two different Replicat types available: Classic and
Coordinated. Classic Replicat is a single-threaded process whereas Coordinated
Replicat is a multithreaded one that applies transactions in parallel.
Figure 8-13 Add Replicat
- Enter the Replicat information, and click Next:
- Replicat Trail: Name of the required trail file. For
sample trail, provide
tr
. - Subdirectory: Provide as
GG_HOME/opt/AdapterExamples/trail/
if using the sample trail. - Target: Google Cloud Storage
Figure 8-14 Add Replicat
- Replicat Trail: Name of the required trail file. For
sample trail, provide
- Leave Managed Options as is and click Next.
Figure 8-15 Add Replicat - Managed Options
- Enter Parameter File details and click Next. In the
Parameter File, you can either specify source to target mapping or leave it
as-is with a wildcard selection. If Coordinated Replicat is selected as the
Replicat Type, then an additional parameter needs to be provided:
TARGETDB LIBFILE libggjava.so SET property=<ggbd-deployment_home>/etc/conf/ogg/your_replicat_name.properties
Figure 8-16 Add Replicat - Enter Parameter Details
- In the Properties File, remove all the pre-configured properties;
but not the first row marked with the replicat name (
# Properties file for Replicat <replicat_name>
). Copy and paste below property list into properties file, update the properties marked as#TODO
and then click Create and Run.#The File Writer Handler – no need to change gg.handlerlist=filewriter gg.handler.filewriter.type=filewriter gg.handler.filewriter.mode=op gg.handler.filewriter.pathMappingTemplate=./dirout gg.handler.filewriter.stateFileDirectory=./dirsta gg.handler.filewriter.fileRollInterval=7m gg.handler.filewriter.inactivityRollInterval=5s gg.handler.filewriter.fileWriteActiveSuffix=.tmp gg.handler.filewriter.finalizeAction=delete ### Avro OCF – no need to change gg.handler.filewriter.format=avro_row_ocf gg.handler.filewriter.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.avro gg.handler.filewriter.format.pkUpdateHandling=delete-insert gg.handler.filewriter.format.metaColumnsTemplate=${optype},${position} gg.handler.filewriter.format.iso8601Format=false gg.handler.filewriter.partitionByTable=true gg.handler.filewriter.rollOnShutdown=true #The Parquet Event Handler – no need to change gg.handler.filewriter.eventHandler=parquet gg.eventhandler.parquet.type=parquet gg.eventhandler.parquet.pathMappingTemplate=./dirparquet gg.eventhandler.parquet.fileNameMappingTemplate=${groupName}_${fullyQualifiedTableName}_${currentTimestamp}.parquet gg.eventhandler.parquet.writeToHDFS=false gg.eventhandler.parquet.finalizeAction=delete #Select GCS Event Handler – no need to change gg.eventhandler.parquet.eventHandler=gcs #TODO Set GCS Event handler – please update as needed gg.eventhandler.gcs.type=gcs gg.eventhandler.gcs.pathMappingTemplate=${fullyQualifiedTableName} #TODO: Edit the GCS bucket name gg.eventhandler.gcs.bucketMappingTemplate=<bucket_name> #TODO: Edit the GCS credentialsFile gg.eventhandler.gcs.credentialsFile=path_to_GCP_Credential_File gg.eventhandler.gcs.finalizeAction=none #TODO Set the classpath with the paths you noted in step1 gg.classpath=path_to/ gcs_12.29.1/: path_to /hadoop_3.4.0/: path_to/parquet_1.12.3/* jvm.bootoptions=-Xmx512m -Xms32m
- If replicat starts successfully, then it is in running state. You
can go to Replicats/Statistics to see the replication statistics.
Figure 8-17 Replicats Statistics
- Go to GCP Cloud Storage bucket and check the files.
Figure 8-18 GCP Cloud Storage Bucket
Note:
- If target GCS bucket does not exist, it will be auto created by GG for DAA. You can use Template Keywords to dynamically assign the container names.
- GCS Event Handler can be configured for proxy server. For more information, see Replicate Data.
- You can use different properties to control the behaviour of file writing. You can set file sizes, inactivity periods, and more. You can get more details in the File Writer blog post.