9.2.5 Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The purpose of the Redshift Event Handler is to apply operations into Redshift tables.
See Flat Files.
- Detailed Functionality
Ensure to use the Redshift Event handler as a downstream Event handler connected to the output of the S3 Event handler. The S3 Event handler loads files generated by the File Writer Handler into Amazon S3. - Operation Aggregation
- Unsupported Operations and Limitations
- Uncompressed UPDATE records
It is mandatory that the trail files used to apply to Redshift contain uncompressed UPDATE operation records, which means that theUPDATE
operations contain full image of the row being updated. - Error During the Data Load Proces
Staging operation data from AWS S3 onto temporary staging tables and updating the target table occurs inside a single transaction. In case of any error(s), the entire transaction is rolled back and the replicat process will ABEND. - Troubleshooting and Diagnostics
- Classpath
Redshift apply relies on the upstream File Writer handler and the S3 Event handler. - Configuration
- INSERTALLRECORDS Support
- Redshift COPY SQL Authorization
The Redshift event handler usesCOPY SQL
to read staged files in Amazon Web Services (AWS) S3 buckets. TheCOPY SQL
query may need authorization credentials to access files in AWS S3. - Co-ordinated Apply Support
- Support for Mixed Case Identifiers
Parent topic: Target
9.2.5.1 Detailed Functionality
Ensure to use the Redshift Event handler as a downstream Event handler connected to the output of the S3 Event handler. The S3 Event handler loads files generated by the File Writer Handler into Amazon S3.
Redshift Event handler uses the COPY SQL to bulk load operation data available in S3 into temporary Redshift staging tables. The staging table data is then used to update the target table. All the SQL operations are performed in batches providing better throughput.
Parent topic: Amazon Redshift
9.2.5.2 Operation Aggregation
- In-Memory Operation Aggregation
- Aggregation using SQL post loading data into the staging table
In this aggregation operation, the in-memory operation aggregation need not be performed. The operation data loaded into the temporary staging table is aggregated using SQL queries, such that the staging table contains just one row per key.
Parent topic: Amazon Redshift
9.2.5.2.1 In-Memory Operation Aggregation
- Operation records are aggregated in-memory by default.
- The
gg.aggregate.operations.flush.interval
property has been deprecated and is no longer supported. If thegg.aggregate.operations.flush.interval
is used in GG for DAA 23ai, then replicat will run; but add a warning to log file about the property being deprecated and not supported.To control the time window for aggregation, use the
gg.handler.redshift.fileRollInterval
property. By default, it is set to 3 minutes. Longer intervals will increase latency, and may increase memory usage. Shorter intervals will increase overhead in Oracle GoldenGate and the target database. - Operation aggregation in-memory requires additional JVM memory configuration.
Parent topic: Operation Aggregation
9.2.5.2.2 Aggregation using SQL post loading data into the staging table
In this aggregation operation, the in-memory operation aggregation need not be performed. The operation data loaded into the temporary staging table is aggregated using SQL queries, such that the staging table contains just one row per key.
Table 9-3 Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.name.aggregateStagingTableRows |
Optional | True| False |
False |
Use SQL to aggregate staging table data before updating the target table. |
Parent topic: Operation Aggregation
9.2.5.3 Unsupported Operations and Limitations
The following operations are not supported by the Redshift Handler:
- DDL changes are not supported.
- Timestamp and Timestamp with Time zone data types: The maximum precision supported is up to microseconds, the nanoseconds portion will be truncated. This is a limitation we have observed with the Redshift COPY SQL.
- Redshift COPY SQL has a limitation on the maximum size of a single input row from any source is 4MB.
Parent topic: Amazon Redshift
9.2.5.4 Uncompressed UPDATE records
It is mandatory that the trail files used to apply to Redshift contain
uncompressed UPDATE operation records, which means that the UPDATE
operations contain full image of the row being updated.
If UPDATE
records have missing columns, then such columns are
updated in the target as null. By setting the parameter
gg.abend.on.missing.columns=true
, replicat can fail
fast on detecting a compressed update trail record. This is the recommended
setting.
Parent topic: Amazon Redshift
9.2.5.5 Error During the Data Load Proces
Staging operation data from AWS S3 onto temporary staging tables and updating the target table occurs inside a single transaction. In case of any error(s), the entire transaction is rolled back and the replicat process will ABEND.
If there are errors with the COPY SQL, then the Redshift system table
stl_load_errors
is also queried and the error traces are made
available in the handler log file.
Parent topic: Amazon Redshift
9.2.5.6 Troubleshooting and Diagnostics
- Connectivity issues to Redshift
- Validate JDBC connection URL, user name and password.
- Check if http/https proxy is enabled. Generally, Redshift endpoints cannot be accessed via proxy.
- DDL and Truncate operations not applied on the target table: The Redshift handler will ignore DDL and truncate records in the source trail file.
- Target table existence: It is expected that the Redshift target table
exists before starting the apply process. Target tables need to be designed with
primary keys, sort keys, partition distribution key columns. Approximations based on
the column metadata in the trail file may not be always correct. Therefore, Redshift
apply will
ABEND
if the target table is missing. - Operation aggregation in-memory
(
gg.aggregagte.operations=true
) is memory intensive where as operation aggregation using SQL(gg.eventhandler.name.aggregateStagingTableRows=true
) requires more SQL processing on the Redshift database. These configurations are mutually exclusive and only one of them should be enabled at a time. Tests within Oracle have revealed that operation aggregation in memory delivers better apply rate. This may not always be the case on all the customer deployments. - Diagnostic information on the apply process is logged onto the handler log file.
- Operation aggregation time (in milli-seconds) in-memory:
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********START*********************************
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Number of update operations merged into an existing update operation: [232653]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent aggregating operations : [22064]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent flushing aggregated operations : [36382]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********END***********************************
- Stage and load processing time (in milli-seconds) for SQL queries
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********START*********************************
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for staging process [277093]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for load process [32650]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********END***********************************
- Stage time (in milli-seconds) will also include additional statistics if operation aggregation using SQL is enabled.
- Co-existence of the components: The location/region of the machine
where replicat process is running, AWS S3 bucket region and the Redshift cluster
region would impact the overall throughput of the apply process. Data flow is as
follows:
GoldenGate => AWS S3 => AWS Redshift
. For best throughput, the components need to be located as close as possible.
Parent topic: Amazon Redshift
9.2.5.7 Classpath
Redshift apply relies on the upstream File Writer handler and the S3 Event handler.
Include the required jars needed to run the S3 Event handler in gg.classpath.
See Amazon S3. Redshift Event handler uses the Redshift JDBC driver. Ensure to
include the jar file in gg.classpath
as shown in the following example:
gg.classpath=aws_sdk_2.28.11/lib/:aws_sdk_2.28.11/third-party/lib/:./redshift-jdbc42-2.1.0.29.jar
Parent topic: Amazon Redshift
9.2.5.8 Configuration
Automatic Configuration
AWS Redshift Data warehouse replication involves configuring of multiple components, such as file writer handler, S3 event handler and Redshift event handler. The Automatic Configuration feature auto configures these components so that you need to perform minimal configurations. The properties modified by auto configuration will also be logged in the handler log file.
gg.target=redshift
gg.target
Required
Legal Value: redshift
Default: None
Explanation: Enables replication to Redshift target
When replicating to Redshift target, the customization of S3 event hander name and Redshift event handler name is not allowed.
File Writer Handler Configuration
File writer handler name is pre-set to the value redshift
. The
following is an example to edit a property of file writer handler:
gg.handler.redshift.pathMappingTemplate=./dirout
S3 Event Handler Configuration
S3 event handler name is pre-set to the value s3. The following is an
example to edit a property of the S3 event handler:
gg.eventhandler.s3.bucketMappingTemplate=bucket1
.
Redshift Event Handler Configuration
The Redshift event handler name is pre-set to the value redshift.
Table 9-4 Properties
Properties | Required/Optional | Legal Value | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.redshift.connectionURL |
Required | Redshift JDBC Connection URL | None |
Sets the Redshift JDBC connection URL. Example:
|
gg.eventhandler.redshift.UserName |
Required | JDBC User Name | None | Sets the Redshift database user name. |
gg.eventhandler.redshift.Password |
Required | JDBC Password | None | Sets the Redshift database password. |
gg.eventhandler.redshift.awsIamRole |
Optional | AWS role ARN in the format:
arn:aws:iam::<aws_account_id>:role/<role_name> |
None | AWS IAM role ARN that the Redshift cluster uses for
authentication and authorization for executing COPY
SQL to access objects in AWS S3 buckets.
|
gg.eventhandler.redshift.useAwsSecurityTokenService |
Optional | true | false |
Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.enableSTS |
Use AWS Security Token Service for authorization. For more information, see Redshift COPY SQL Authorization. |
gg.eventhandler.redshift.awsSTSEndpoint |
Optional | A valid HTTPS URL. | Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.stsURL .
|
The AWS STS endpoint string. For example: https://sts.us-east-1.amazonaws.com. For more information, see Redshift COPY SQL Authorization. |
gg.eventhandler.redshift.awsSTSRegion |
Optional | A valid AWS region. | Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.stsRegion .
|
The AWS STS region. For example,
us-east-1 . For more information, see Redshift COPY SQL Authorization.
|
gg.initialLoad |
Optional | true | false |
false |
If set to true , initial load mode
is enabled. See INSERTALLRECORDS Support.
|
gg.eventhandler.redshift.connectionRetryIntervalSeconds
|
Optional | Integer Value | 30 | Specifies the delay in seconds between connection retry attempts. |
gg.eventhandler.redshift.connectionRetries
|
Optional | Integer Value | 3 | Specifies the number of times connections to the target data warehouse will be retried. |
gg.handler.redshift.fileRollInterval |
Optional | The default unit of measure is milliseconds. You can stipulate ms, s, m, h to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include 10000, 10000ms, 10s, 10m, or 1.5h. Values of 0 or less indicate that file rolling on time is turned off. | 3m (three minutes) | The parameter determines how often the data will be
merged into Redshift. Use with caution, the higher this value is the
more data will need to be stored in the memory of the Replicat
process.
Note: Use the parameter with caution. Increasing its default value (3m) will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory.Note: Starting with the 23ai release, thegg.aggregate.operations.flush.interval
property is deprecated and no longer supported. For more
information, see In-Memory Operation Aggregation |
End-to-End Configuration
The following is an end-end configuration example which uses auto configuration for FW handler, S3 and Redshift Event handlers.
The sample properties are available at the following location
- In an Oracle GoldenGate Classic install:
<oggbd_install_dir>/AdapterExamples/big-data/redshift-via-s3/rs.props
- In an Oracle GoldenGate Microservices install:
<
oggbd_install_dir>/opt/AdapterExamples/big-data/redshift-via-s3/rs.props
# Configuration to load GoldenGate trail operation records # into Amazon Redshift by chaining # File writer handler -> S3 Event handler -> Redshift Event handler. # Note: Recommended to only edit the configuration marked as TODO
gg.target=redshift #The S3 Event Handler #TODO: Edit the AWS region gg.eventhandler.s3.region=<aws region> #TODO: Edit the AWS S3 bucket gg.eventhandler.s3.bucketMappingTemplate<s3bucket>
#The Redshift Event Handler #TODO: Edit ConnectionUrl gg.eventhandler.redshift.connectionURL=jdbc:redshift://aws-redshift-instance.cjoaij3df5if.us-east-2.redshift.amazonaws.com:5439/mydb #TODO: Edit Redshift user name gg.eventhandler.redshift.UserName=<db user name> #TODO: Edit Redshift password gg.eventhandler.redshift.Password=<db password> #TODO:Set the classpath to include AWS Java SDK and Redshift JDBC driver. gg.classpath=aws_sdk_2.28.11/lib/:aws_sdk_2.28.11/third-party/lib/:./redshift-jdbc42-2.1.0.29.ja
Parent topic: Amazon Redshift
9.2.5.9 INSERTALLRECORDS Support
Stage and merge targets supports INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in Reference for
Oracle GoldenGate. Set the INSERTALLRECORDS
parameter in
the Replicat parameter file (.prm
).
Setting this property directs the Replicat process to use bulk insert
operations to load operation data into the target table. You can tune the batch size
of bulk inserts using the File Writer property
gg.handler.redshift.maxFileSize
. The default value is set to
1GB. The frequency of bulk inserts can be tuned using the File Writer property
gg.handler.redshift.fileRollInterval
, the default value is set
to 3m (three minutes).
Note:
Parent topic: Amazon Redshift
9.2.5.10 Redshift COPY SQL Authorization
The Redshift event handler uses COPY SQL
to read staged files in
Amazon Web Services (AWS) S3 buckets. The COPY SQL
query may need
authorization credentials to access files in AWS S3.
Authorization can be provided by using an AWS Identity and Access Management (IAM) role that is attached to the Redshift cluster or by providing a AWS access key and a secret for the access key. As a security consideration, it is a best practise to use role-based access when possible.
AWS Key-Based Authorization
With key-based access control, you provide the access key ID and secret access key for an AWS IAM user that is authorized to access AWS S3. The access key id and secret access key are retrieved by looking up the credentials as follows:
- Environment variables -
AWS_ACCESS_KEY/AWS_ACCESS_KEY_ID
andAWS_SECRET_KEY/AWS_SECRET_ACCESS_KEY
. - Java System Properties -
aws.accessKeyId
andaws.secretAccessKey
. - Credential profiles file at the default location
(
~/.aws/credentials
). - Amazon Elastic Container Service (ECS) container credentials loaded
from Amazon ECS if the environment variable
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
is set. - Instance profile credentials retrieved from Amazon Elastic Compute Cloud (EC2) metadata service.
Running Replicat on an AWS EC2 Instance
If the replicat process is started on an AWS EC2 instance, then the access key ID and secret access key are automatically retrieved by Oracle GoldenGate for BigData and no explicit user configuration is required.
Temporary Security Credentials using AWS Security Token Service (STS)
If you use the key-based access control, then you can further limit the access users have to your data by retrieving temporary security credentials using AWS Security Token Service. The auto configure feature of the Redshift event handler automatically picks up the AWS Security Token Service (STS) configuration from S3 event handler.
Table 9-5 S3 Event Handler Configuration and Redshift Event Handler Configuration
S3 Event Handler Configuration | Redshift Event Handler Configuration |
---|---|
enableSTS |
useAwsSTS |
stsURL |
awsSTSEndpoint |
stsRegion |
awsSTSRegion |
AWS IAM Role-based Authorization
With role-based authorization, Redshift cluster temporarily assumes an
IAM role when executing COPY SQL
. You need to provide the role
Amazon Resource Number (ARN) as a configuration value as follows:
gg.eventhandler.redshift.AwsIamRole
. For example:
gg.eventhandler.redshift.AwsIamRole=arn:aws:iam::<aws_account_id>:role/<role_name>
.
The role needs to be authorized to read the respective S3 bucket. Ensure that the
trust relationship of the role contains the AWS redshift service. Additionally,
attach this role to the Redshift cluster before starting the Redshift cluster. For
example, AWS IAM policy that can be used in the the trust relationship of the
role.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "redshift.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
If the role-based authorization is configured
(gg.eventhandler.redshift.AwsIamRole
), then it is given
priority over key-based authorization.
Parent topic: Amazon Redshift
9.2.5.11 Co-ordinated Apply Support
To enable co-ordinated apply for Redshift, ensure that the Redshift
database's isolation level is set to SNAPSHOT
. The Redshift
SNAPSHOT ISOLATION
option allows higher concurrency, where
concurrent modifications to different rows in the same table can complete
successfully.
SQL Query to Alter the Database's Isolation Level
ALTER DATABASE <sampledb>
ISOLATION LEVEL SNAPSHOT;
Parent topic: Amazon Redshift
9.2.5.12 Support for Mixed Case Identifiers
MAP QASOURCE.TCUSTMER, TARGET "QaTarget"."TCustmer";
Parent topic: Amazon Redshift