9.2.15 Azure Synapse Analytics Data Warehouse
Microsoft Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) analytics.
- Detailed Functionality
- Operation Aggregation
- Compressed Update Handling
- Configuration
- Troubleshooting and Diagnostics
Parent topic: Target
9.2.15.1 Detailed Functionality
Replication to Synapse uses stage and merge data flow.
The change data is staged in a temporary location in micro-batches and eventually merged into the target table.
Azure Data Lake Storage (ADLS) Gen 2 is used as the staging area for change data.
The Synapse Event handler is used as a downstream Event handler connected to the output of the Parquet Event handler.
The Parquet Event handler loads files generated by the File Writer Handler into ADLS Gen2.
The Synapse Event handler executes SQL statements to merge the operation records staged in ADLS Gen2.
The SQL operations are performed in batches providing better throughput.
Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) uses
the MERGE SQL
statement or a combination of DELETE
and
INSERT
SQL statements to perform the merge operation.
Parent topic: Azure Synapse Analytics Data Warehouse
9.2.15.1.1 Database User Privileges
Database user used for replication has to be granted the following privileges:
INSERT
,UPDATE
,DELETE
, andTRUNCATE
on the target tables.CREATE
andDROP
Synapse external file format.CREATE
andDROP
Synapse external data source.CREATE
andDROP
Synapse external table.
Parent topic: Detailed Functionality
9.2.15.1.2 Merge SQL Statement
The merge SQL statement for Azure Synapse Analytics was
made generally available during the later part of the year 2022 and
therefore Oracle GoldenGate for Distributed Applications and
Analytics (GG for DAA) uses merge statement by default. To disable
merge SQL, ensure that a Java System property is set in the
jvm.bootoptions
parameter.
jvm.bootoptions=-Dsynapse.use.merge.sql=false
Parent topic: Detailed Functionality
9.2.15.1.3 Prerequisites
The following are the prerequisites:
- Uncompressed
UPDATE
records: If Oracle GoldenGate is configured to not use merge statement (see Merge SQL Statement), then it is mandatory that the trail files used to apply to Synapse contain uncompressedUPDATE
operation records, which means that theUPDATE
operations contain full image of the row being updated. IfUPDATE
records have missing columns, then replicat willABEND
on detecting a compressedUPDATE
trail record. - If Oracle GoldenGate is configured to use merge statement (see Merge SQL Statement), then the target table must be a hash distributed table.
- Target table existence: The target tables should exist on the Synapse database.
- Azure storage account: An Azure storage account and container should
exist.
Oracle recommends co-locating the Azure Synapse workspace, and the Azure storage account in the same azure region.
- If Oracle GoldenGate is configured to use merge statement, then the target
table cannot define
IDENTITY
columns because Synapse merge statement does not support inserting data intoIDENTITY
columns. For more information about merging SQL statement, see Merge SQL Statement.
Parent topic: Detailed Functionality
9.2.15.2 Operation Aggregation
Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.
Parent topic: Azure Synapse Analytics Data Warehouse
9.2.15.2.1 In-Memory Operation Aggregation
- Operation records are aggregated in-memory by default.
- The
gg.aggregate.operations.flush.interval
property has been deprecated and is no longer supported. Ifgg.aggregate.operations.flush.interval
is used in GG for DAA 23ai, then replicat will run; but add a warning to log file about the property being deprecated and not supported.To control the time window for aggregation, use the
gg.handler.synapse.fileRollInterval
property. By default, it is set to 3 minutes. Longer intervals will increase latency, and may increase memory usage. Shorter intervals will increase overhead in Oracle GoldenGate and the target database. - Operation aggregation in-memory requires additional JVM memory configuration.
Parent topic: Operation Aggregation
9.2.15.2.2 Operation Aggregation Using SQL
- To use SQL aggregation, it is mandatory that the trail files
contain uncompressed
UPDATE
operation records, which means that theUPDATE
operations contain full image of the row being updated. - Operation aggregation using SQL can provide better throughput if the trails files contains uncompressed update records.
- Replicat can aggregate operations using SQL statements by setting the
gg.aggregate.operations.using.sql=true
. - You can tune the frequency of merge interval using the File
writer
gg.handler.synapse.fileRollInterval
property, the default value is set to 3m (three minutes). - Operation aggregation using SQL does not require additional JVM memory configuration.
Parent topic: Operation Aggregation
9.2.15.3 Compressed Update Handling
A compressed update record contains values for the key columns and the modified columns.
An uncompressed update record contains values for all the columns.
Oracle GoldenGate trails may contain compressed or uncompressed update
records. The default extract configuration writes compressed updates to the trails.
The parameter gg.compressed.update
can be set to
true
or false
to indicate compressed or
uncompressed update records.
The default extract configuration writes compressed updates to the trails. The
parameter gg.compressed.update
can be set to true
or false
to indicate compressed/uncompressed update records.
9.2.15.3.1 MERGE statement with Uncompressed Updates
In some use cases, if the trail contains uncompressed update records, then the
MERGE SQL
statement can be optimized for better performance by
setting gg.compressed.update=false
.
If you want to use DELETE+INSERT SQL
statements instead of a
MERGE SQL
statement, then set
gg.eventhandler.synapse.deleteInsert=true
.
Parent topic: Compressed Update Handling
9.2.15.4 Configuration
- Automatic Configuration
- Synapse Database Credentials
- Classpath Configuration
- INSERTALLRECORDS Support
- Large Object (LOB) Performance
- End-to-End Configuration
Parent topic: Azure Synapse Analytics Data Warehouse
9.2.15.4.1 Automatic Configuration
Synapse replication involves configuration of multiple components, such as File Writer handler, Parquet Event handler, and Synapse Event handler.
The Automatic Configuration functionality helps to auto configure these components so that the user configuration is minimal.
The properties modified by auto configuration will also be logged in the handler log file.
To enable auto-configuration to replicate to Synapse target we need to set the
parameter as follows: gg.target=synapse
.
When replicating to Synapse target, customization of Parquet Event handler name and Synapse Event handler name is not allowed.
- File Writer Handler Configuration
- Parquet Event Handler Configuration
- Synapse Event Handler Configuration
Parent topic: Configuration
9.2.15.4.1.1 File Writer Handler Configuration
synapse
. The following is an example to edit a property of File
Writer handler:
gg.handler.synapse.pathMappingTemplate=./dirout
Parent topic: Automatic Configuration
9.2.15.4.1.2 Parquet Event Handler Configuration
The Parquet Event Handler name is pre-set to the value
parquet
. The Parquet Event Handler is auto-configured to write
to HDFS. The hadoop configuration file core-site.xml
must be
configured to write data files to the respective container in the Azure Data Lake
Storage(ADLS) Gen2 account. See Azure Data Lake Gen2 using Hadoop Client and ABFS.
gg.eventhandler.parquet.finalizeAction=delete
Parent topic: Automatic Configuration
9.2.15.4.1.3 Synapse Event Handler Configuration
Synapse Event Handler name is pre-set to the value
synapse
.
Table 9-12 Synapse Event Handler Configuration
Properties | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.synapse.connectionURL |
Required |
jdbc:sqlserver://<synapse- workspace>.sql.azuresynapse.net:1433;database= <db-name>;encrypt=true; trustServerCertificate=false; hostNameInCertificate=*.sql.azuresynapse.net; loginTimeout=300; |
None | JDBC URL to connect to Synapse. |
gg.eventhandler.synapse.UserName |
Required | Database username. | None | Synapse database user in the Synapse workspace. The
username has to be qualified with the Synapse workspace name.
Example: sqladminuser@synapseworkspace .
|
gg.eventhandler.synapse.Password |
Required | Supported database string. | None | Synapse database password. |
gg.eventhandler.synapse.credential |
Required | Credential name. | None | Synapse database credential name to access Azure Data Lake Gen2 files. See Synapse Database Credentials for steps to create credential. |
gg.eventhandler.synapse.maxConnnections |
Optional | Integer value | 10 | Use this parameter to control the number of concurrent JDBC database connections to the target Synapse database. |
gg.eventhandler.synapse.dropStagingTablesOnShutdown |
Optional | true or
false |
false |
If set to true , the temporary
staging tables created by GoldenGate will be dropped on replicat
graceful stop.
|
gg.maxInlineLobSize |
Optional | Integer Value | 16000 | This parameter can be used to set the maximum inline size of large object (LOB) columns in bytes. For more information, see Large Object (LOB) Performance. |
gg.handler.synapse.fileRollInterval |
Optional | The default unit of measure is milliseconds. You can stipulate ms, s, m, h to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include 10000, 10000ms, 10s, 10m, or 1.5h. Values of 0 or less indicate that file rolling on time is turned off. | 3m (three minutes) | The parameter determines how often the data will be
merged into Synapse. Use with caution, the higher this value is the
more data will need to be stored in the memory of the Replicat
process.
Note: Use the parameter with caution. Increasing its default value (3m) will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory.Note: Starting with the 23ai release, thegg.aggregate.operations.flush.interval
property is deprecated and no longer supported. For more
information, see In-Memory Operation Aggregation |
gg.compressed.update |
Optional | true or
false |
true |
If set the true , then this
indicates that the source trail files contain compressed update
operations. If set to true , then the source trail
files are expected to contain uncompressed update
operations.
|
gg.eventhandler.synapse.connectionRetryIntervalSeconds
|
Optional | Integer Value | 30 | Specifies the delay in seconds between connection retry attempts. |
gg.eventhandler.synapse.connectionRetries
|
Optional | Integer Value | 3 | Specifies the number of times connections to the target data warehouse will be retried. |
gg.validate.keyupdate |
Optional | true or
false |
false |
If set to true , Replicat will
validate key update operations (optype 115) and correct to normal
update if no key values have changed. Compressed key update
operations do not qualify for merge.
|
gg.eventhandler.synapse.deleteInsert |
Optional | true or
false |
false |
If set to true , Replicat will merge
records using SQL DELETE+INSERT statements instead
of SQL MERGE statement.
Note: Applicable only ifgg.compressed.update is set to
false .
|
Parent topic: Automatic Configuration
9.2.15.4.2 Synapse Database Credentials
- Connect to the respective Synapse SQL dedicated pool using the Azure Web SQL console
(
https://web.azuresynapse.net/en-us/
). - Create a DB master key if one does not already exist, using your own password.
- Create a database scoped credential. This credential allows Oracle GoldenGate replicat
process to access Azure Storage Account.
Provide the Azure Storage Account name and Access key when creating this credential.
Storage Account Access keys can be retrieved from the Azure cloud console.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Your own password' ; CREATE DATABASE SCOPED CREDENTIAL OGGBD_ADLS_credential WITH -- IDENTITY = '<storage_account_name>' , IDENTITY = 'sanavaccountuseast' , -- SECRET = '<storage_account_key>' SECRET = 'c8C0yR-this-is-a-fake-access-key-Gc9c5mENOJ1mLyxlO1vSRDlRG0/Ke+tbAvi6xe73HAAhLtdMFZRA==' ;
Parent topic: Configuration
9.2.15.4.3 Classpath Configuration
Synapse Event handler relies on the upstream File Writer handler and the Parquet Event handler.
Parent topic: Configuration
9.2.15.4.3.1 Dependencies
- Microsoft SQLServer JDBC driver: The JDBC driver can be
downloaded from Maven central using the following co-ordinates.
<dependency> <groupId>com.microsoft.sqlserver</groupId> <artifactId>mssql-jdbc</artifactId> <version>8.4.1.jre8</version> <scope>provided</scope> </dependency>
<OGGDIR>/DependencyDownloader/synapse.sh
.
-
Parquet Event handler dependencies: See Parquet Event Handler Configuration to configure classpath to include Parquet dependencies.
- Hadoop Depedencies: Hadoop dependencies can be downloaded using dependency
downloader
<OGGDIR>/DependencyDownloader/hadoop.sh
- Azure Storage dependencies: Azure Storage dependencies can be downloaded from Maven using the following co-ordinates.
<dependencies> <dependency> <groupId>com.azure</groupId> <artifactId>azure-storage-blob</artifactId> <version>12.13.0</version> </dependency> <dependency> <groupId>com.azure</groupId> <artifactId>azure-identity</artifactId> <version>1.3.3</version> </dependency> </dependencies>
Parent topic: Classpath Configuration
9.2.15.4.3.2 Classpath
gg.classpath configuration
parameter to
include the path to the Parquet Event Handler, Synapse JDBC, Azure Storage and hadoop
dependencies.
gg.classpath=/path/to/parquet-deps/*:/path/to/abs-deps/*:path/to/synapse-deps/mssql-jdbc-8.4.1.jre8.jar:/path/to/hadoop-deps/*
Parent topic: Classpath Configuration
9.2.15.4.4 INSERTALLRECORDS Support
Stage and merge targets supports INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in Reference for
Oracle GoldenGate. Set the INSERTALLRECORDS
parameter in
the Replicat parameter file (.prm
). Set the
INSERTALLRECORDS
parameter in the Replicat parameter file
(.prm
)
Setting this property directs the Replicat process to use bulk insert
operations to load operation data into the target table. You can tune the batch size
of bulk inserts using the File Writer property
gg.handler.synapse.maxFileSize
. The default value is set to
1GB. The frequency of bulk inserts can be tuned using the File Writer property
gg.handler.synapse.fileRollInterval
, the default value is set
to 3m (three minutes).
Note:
- When using the Synapse internal stage, the staging files can be
compressed by setting
gg.handler.synapse.putSQLAutoCompress
totrue
.
Parent topic: Configuration
9.2.15.4.5 Large Object (LOB) Performance
gg.maxInlineLobSize
does not qualify for batch processing and such
operations gets slower.
If the compute machine has sufficient RAM, you can increase this parameter to speed up processing.
Parent topic: Configuration
9.2.15.4.6 End-to-End Configuration
The following is an end-end configuration example which uses auto-configuration for FW handler, Parquet and Synapse Event handlers.
This sample properties file can also be found in the directory
AdapterExamples/big-data/synapse/synapse.props
:
# Azure Synapse Analytics Data Warehouse Template # Configuration to load GoldenGate trail operation records into Azure Synapse Analytics by chaining # File writer handler -> Parquet Event handler -> Synapse Event handler. # Note: Recommended to only edit the configuration marked as TODO gg.target=synapse #The Parquet Event Handler No properties are required for the Parquet Event handler. #gg.eventhandler.parquet.finalizeAction=delete ADLS Gen 2 stage (Using Azure Blob SDK). #Azure Blob Event handler #TODO: Edit the Azure Blob Storage container name gg.eventhandler.abs.bucketMappingTemplate=<abs-container-name> #TODO: Edit the Azure storage.account name. gg.eventhandler.abs.accountName=<storage-account-name> #TODO: Edit the Azure storage account key. gg.eventhandler.abs.accountKey=<storage-account-key> #The Synapse Event Handler #TODO: Edit JDBC ConnectionUrl gg.eventhandler.synapse.connectionURL=jdbc:sqlserver://<synapse-workspace>.sql.azuresynapse.net:1433;database=<db-name>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=300; #TODO: Edit JDBC user name gg.eventhandler.synapse.UserName=<db user name>@<synapse-workspace> #TODO: Edit JDBC password gg.eventhandler.synapse.Password=<db password> #TODO: Edit Credential to access Azure storage. gg.eventhandler.synapse.credential=OGGBD_ADLS_credential #TODO: Edit the classpath to include dependencies for Parquet Event Handler, ABS Event handler and the Synapse JDBC driver. gg.classpath=/path/to/parquet-deps/:/path/to/abs-deps/:path/to/synapse-deps/mssql-jdbc-8.4.1.jre8.jar:/path/to/hadoop-deps/* #TODO: Provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms8g -DSYNAPSE_STAGE=parquet,abs
Parent topic: Configuration
9.2.15.5 Troubleshooting and Diagnostics
- Connectivity Issues to Synapse:
- Validate JDBC connection URL, username and password.
- Check if http/https proxy is enabled. Synapse does not support connections over http(s) proxy.
- DDL not applied on the target table: Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) does not support DDL replication.
- Target table existence: It is expected that the Synapse target table exists before starting the replicat process. replicat process will ABEND if the target table is missing.
- SQL Errors: In case there are any errors while executing any SQL, the entire SQL statement along with the bind parameter values are logged into the GG for DAA handler log file.
- Co-existence of the components: The location/region of the machine where replicat process is running, Azure Data Lake Storage container region and the Synapse region would impact the overall throughput of the apply process. Data flow is as follows: Oracle GoldenGate -> Azure Data Lake Gen 2 -> Synapse. For best throughput, the components need to located as close as possible.
- Replicat ABEND due to partial LOB records in the trail file:
GG for DAA Synapse apply does not support replication of partial LOB. The trail
file needs to be regenerated by Oracle Integrated capture using
TRANLOGOPTIONS FETCHPARTIALLOB
option in the extract parameter file. - Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Conversion failed when converting date and/or time from character string
:This occurs when the source datetime column and target datetime column are incompatible.
For example: A case where the source column is a timestamp type, and the target column is Synapse time.
- If the Synapse table or column names contain double quotes, then GG for DAA replicat will ABEND.
- Error:
com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer
. This indicates that the data in the external table backed by Azure Data Lake file is not readable. Contact Oracle support. - IDENTITY column in the target table: The Synapse
MERGE
statement does not support inserting data intoIDENTITY
columns. Therefore, ifMERGE
statement is enabled usingjvm.bootoptions=-Dsynapse.use.merge.sql=true
, then Replicat will ABEND with following error message:Exception:com.microsoft.sqlserver.jdbc.SQLServerException: Cannot update identity column 'ORDER_ID'
- Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Merge statements with a
:WHEN NOT MATCHED [BY TARGET]
clause must target a hash distributed tableThis indicates that merge SQL statement is on and Synapse target table is not a hash distributed table. You need to create the target table with a hash distribution.
Parent topic: Azure Synapse Analytics Data Warehouse