13 Using the Redshift Event Handler
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The purpose of the Redshift Event Handler is to apply operations into Redshift tables.
- Detailed Functionality
Ensure to use the Redshift Event handler as a downstream Event handler connected to the output of the S3 Event handler. The S3 Event handler loads files generated by the File Writer Handler into Amazon S3. - Operation Aggregation
- Unsupported Operations and Limitations
- Uncompressed UPDATE records
It is mandatory that the trail files used to apply to Redshift contain uncompressed UPDATE operation records, which means that theUPDATE
operations contain full image of the row being updated. - Error During the Data Load Proces
Staging operation data from AWS S3 onto temporary staging tables and updating the target table occurs inside a single transaction. In case of any error(s), the entire transaction is rolled back and the replicat process will ABEND. - Troubleshooting and Diagnostics
- Classpath
Redshift apply relies on the upstream File Writer handler and the S3 Event handler. - Configuration
- Redshift COPY SQL Authorization
The Redshift event handler usesCOPY SQL
to read staged files in Amazon Web Services (AWS) S3 buckets. TheCOPY SQL
query may need authorization credentials to access files in AWS S3.
13.1 Detailed Functionality
Ensure to use the Redshift Event handler as a downstream Event handler connected to the output of the S3 Event handler. The S3 Event handler loads files generated by the File Writer Handler into Amazon S3.
Redshift Event handler uses the COPY SQL to bulk load operation data available in S3 into temporary Redshift staging tables. The staging table data is then used to update the target table. All the SQL operations are performed in batches providing better throughput.
Parent topic: Using the Redshift Event Handler
13.2 Operation Aggregation
- Aggregation In Memory
Before loading the operation data into S3, the operations in the trail file are aggregated. Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold. - Aggregation using SQL post loading data into the staging table
In this aggregation operation, the in-memory operation aggregation need not be performed. The operation data loaded into the temporary staging table is aggregated using SQL queries, such that the staging table contains just one row per key.
Parent topic: Using the Redshift Event Handler
13.2.1 Aggregation In Memory
Before loading the operation data into S3, the operations in the trail file are aggregated. Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.
Table 13-1 Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Optional |
|
|
Aggregate operations based on the primary key of the operation record. |
Parent topic: Operation Aggregation
13.2.2 Aggregation using SQL post loading data into the staging table
In this aggregation operation, the in-memory operation aggregation need not be performed. The operation data loaded into the temporary staging table is aggregated using SQL queries, such that the staging table contains just one row per key.
Table 13-2 Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.name.aggregateStagingTableRows |
Optional | True| False |
False |
Use SQL to aggregate staging table data before updating the target table. |
Parent topic: Operation Aggregation
13.3 Unsupported Operations and Limitations
The following operations are not supported by the Redshift Handler:
- DDL changes are not supported.
- Timestamp and Timestamp with Time zone data types: The maximum precision supported is up to microseconds, the nanoseconds portion will be truncated. This is a limitation we have observed with the Redshift COPY SQL.
- Redshift COPY SQL has a limitation on the maximum size of a single input row from any source is 4MB.
Parent topic: Using the Redshift Event Handler
13.4 Uncompressed UPDATE records
It is mandatory that the trail files used to apply to Redshift contain
uncompressed UPDATE operation records, which means that the UPDATE
operations contain full image of the row being updated.
If UPDATE
records have missing columns, then such columns are
updated in the target as null. By setting the parameter
gg.abend.on.missing.columns=true
, replicat can fail
fast on detecting a compressed update trail record. This is the recommended
setting.
Parent topic: Using the Redshift Event Handler
13.5 Error During the Data Load Proces
Staging operation data from AWS S3 onto temporary staging tables and updating the target table occurs inside a single transaction. In case of any error(s), the entire transaction is rolled back and the replicat process will ABEND.
If there are errors with the COPY SQL, then the Redshift system table
stl_load_errors
is also queried and the error traces are made
available in the handler log file.
Parent topic: Using the Redshift Event Handler
13.6 Troubleshooting and Diagnostics
- Connectivity issues to Redshift
- Validate JDBC connection URL, user name and password.
- Check if http/https proxy is enabled. Generally, Redshift endpoints cannot be accessed via proxy.
- DDL and Truncate operations not applied on the target table: The Redshift handler will ignore DDL and truncate records in the source trail file.
- Target table existence: It is expected that the Redshift target table
exists before starting the apply process. Target tables need to be designed with
primary keys, sort keys, partition distribution key columns. Approximations based on
the column metadata in the trail file may not be always correct. Therefore, Redshift
apply will
ABEND
if the target table is missing. - Operation aggregation in-memory
(
gg.aggregagte.operations=true
) is memory intensive where as operation aggregation using SQL(gg.eventhandler.name.aggregateStagingTableRows=true
) requires more SQL processing on the Redshift database. These configurations are mutually exclusive and only one of them should be enabled at a time. Tests within Oracle have revealed that operation aggregation in memory delivers better apply rate. This may not always be the case on all the customer deployments. - Diagnostic information on the apply process is logged onto the handler log file.
- Operation aggregation time (in milli-seconds) in-memory:
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********START*********************************
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Number of update operations merged into an existing update operation: [232653]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent aggregating operations : [22064]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent flushing aggregated operations : [36382]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********END***********************************
- Stage and load processing time (in milli-seconds) for SQL queries
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********START*********************************
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for staging process [277093]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for load process [32650]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********END***********************************
- Stage time (in milli-seconds) will also include additional statistics if operation aggregation using SQL is enabled.
- Co-existence of the components: The location/region of the machine
where replicat process is running, AWS S3 bucket region and the Redshift cluster
region would impact the overall throughput of the apply process. Data flow is as
follows:
GoldenGate => AWS S3 => AWS Redshift
. For best throughput, the components need to be located as close as possible.
Parent topic: Using the Redshift Event Handler
13.7 Classpath
Redshift apply relies on the upstream File Writer handler and the S3 Event handler.
Include the required jars needed to run the S3 Event handler in gg.classpath.
See Using the S3 Event Handler. Redshift Event handler uses the Redshift JDBC driver. Ensure to
include the jar file in gg.classpath
as shown in the following example:
gg.classpath=aws_sdk_2.28.11/lib/:aws_sdk_2.28.11/third-party/lib/:./redshift-jdbc42-2.1.0.29.jar
Parent topic: Using the Redshift Event Handler
13.8 Configuration
Automatic Configuration
AWS Redshift Data warehouse replication involves configuring of multiple components, such as file writer handler, S3 event handler and Redshift event handler. The Automatic Configuration feature auto configures these components so that you need to perform minimal configurations. The properties modified by auto configuration will also be logged in the handler log file.
gg.target=redshift
gg.target
Required
Legal Value: redshift
Default: None
Explanation: Enables replication to Redshift target
When replicating to Redshift target, the customization of S3 event hander name and Redshift event handler name is not allowed.
File Writer Handler Configuration
File writer handler name is pre-set to the value redshift
. The
following is an example to edit a property of file writer handler:
gg.handler.redshift.pathMappingTemplate=./dirout
S3 Event Handler Configuration
S3 event handler name is pre-set to the value s3. The following is an
example to edit a property of the S3 event handler:
gg.eventhandler.s3.bucketMappingTemplate=bucket1
.
Redshift Event Handler Configuration
The Redshift event handler name is pre-set to the value redshift.
Table 13-3 Properties
Properties | Required/Optional | Legal Value | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.redshift.connectionURL |
Required | Redshift JDBC Connection URL | None |
Sets the Redshift JDBC connection URL. Example:
|
gg.eventhandler.redshift.UserName |
Required | JDBC User Name | None | Sets the Redshift database user name. |
gg.eventhandler.redshift.Password |
Required | JDBC Password | None | Sets the Redshift database password. |
gg.eventhandler.redshift.awsIamRole |
Optional | AWS role ARN in the format:
arn:aws:iam::<aws_account_id>:role/<role_name> |
None | AWS IAM role ARN that the Redshift cluster uses for
authentication and authorization for executing COPY
SQL to access objects in AWS S3 buckets.
|
gg.eventhandler.redshift.useAwsSecurityTokenService |
Optional | true | false |
Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.enableSTS |
Use AWS Security Token Service for authorization. For more information, see Redshift COPY SQL Authorization. |
gg.eventhandler.redshift.awsSTSEndpoint |
Optional | A valid HTTPS URL. | Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.stsURL .
|
The AWS STS endpoint string. For example: https://sts.us-east-1.amazonaws.com. For more information, see Redshift COPY SQL Authorization. |
gg.eventhandler.redshift.awsSTSRegion |
Optional | A valid AWS region. | Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.stsRegion .
|
The AWS STS region. For example,
us-east-1 . For more information, see Redshift COPY SQL Authorization.
|
End-to-End Configuration
The following is an end-end configuration example which uses auto configuration for FW handler, S3 and Redshift Event handlers.
The sample properties are available at the following location (in an
Oracle GoldenGate Classic install):
<oggbd_install_dir>/AdapterExamples/big-data/redshift-via-s3/rs.props
# Configuration to load GoldenGate trail operation records # into Amazon Redshift by chaining # File writer handler -> S3 Event handler -> Redshift Event handler. # Note: Recommended to only edit the configuration marked as TODO
gg.target=redshift #The S3 Event Handler #TODO: Edit the AWS region gg.eventhandler.s3.region=<aws region> #TODO: Edit the AWS S3 bucket gg.eventhandler.s3.bucketMappingTemplate<s3bucket>
#The Redshift Event Handler #TODO: Edit ConnectionUrl gg.eventhandler.redshift.connectionURL=jdbc:redshift://aws-redshift-instance.cjoaij3df5if.us-east-2.redshift.amazonaws.com:5439/mydb #TODO: Edit Redshift user name gg.eventhandler.redshift.UserName=<db user name> #TODO: Edit Redshift password gg.eventhandler.redshift.Password=<db password> #TODO:Set the classpath to include AWS Java SDK and Redshift JDBC driver. gg.classpath=aws_sdk_2.28.11/lib/:aws_sdk_2.28.11/third-party/lib/:./redshift-jdbc42-2.1.0.29.ja
Parent topic: Using the Redshift Event Handler
13.9 Redshift COPY SQL Authorization
The Redshift event handler uses COPY SQL
to read staged files in
Amazon Web Services (AWS) S3 buckets. The COPY SQL
query may need
authorization credentials to access files in AWS S3.
Authorization can be provided by using an AWS Identity and Access Management (IAM) role that is attached to the Redshift cluster or by providing a AWS access key and a secret for the access key. As a security consideration, it is a best practise to use role-based access when possible.
AWS Key-Based Authorization
With key-based access control, you provide the access key ID and secret access key for an AWS IAM user that is authorized to access AWS S3. The access key id and secret access key are retrieved by looking up the credentials as follows:
- Environment variables -
AWS_ACCESS_KEY/AWS_ACCESS_KEY_ID
andAWS_SECRET_KEY/AWS_SECRET_ACCESS_KEY
. - Java System Properties -
aws.accessKeyId
andaws.secretAccessKey
. - Credential profiles file at the default location
(
~/.aws/credentials
). - Amazon Elastic Container Service (ECS) container credentials loaded
from Amazon ECS if the environment variable
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
is set. - Instance profile credentials retrieved from Amazon Elastic Compute Cloud (EC2) metadata service.
Running Replicat on an AWS EC2 Instance
If the replicat process is started on an AWS EC2 instance, then the access key ID and secret access key are automatically retrieved by Oracle GoldenGate for BigData and no explicit user configuration is required.
Temporary Security Credentials using AWS Security Token Service (STS)
If you use the key-based access control, then you can further limit the access users have to your data by retrieving temporary security credentials using AWS Security Token Service. The auto configure feature of the Redshift event handler automatically picks up the AWS Security Token Service (STS) configuration from S3 event handler.
Table 13-4 S3 Event Handler Configuration and Redshift Event Handler Configuration
S3 Event Handler Configuration | Redshift Event Handler Configuration |
---|---|
enableSTS |
useAwsSTS |
stsURL |
awsSTSEndpoint |
stsRegion |
awsSTSRegion |
AWS IAM Role-based Authorization
With role-based authorization, Redshift cluster temporarily assumes an
IAM role when executing COPY SQL
. You need to provide the role
Amazon Resource Number (ARN) as a configuration value as follows:
gg.eventhandler.redshift.AwsIamRole
. For example:
gg.eventhandler.redshift.AwsIamRole=arn:aws:iam::<aws_account_id>:role/<role_name>
.
The role needs to be authorized to read the respective S3 bucket. Ensure that the
trust relationship of the role contains the AWS redshift service. Additionally,
attach this role to the Redshift cluster before starting the Redshift cluster. For
example, AWS IAM policy that can be used in the the trust relationship of the
role.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "redshift.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
If the role-based authorization is configured
(gg.eventhandler.redshift.AwsIamRole
), then it is given
priority over key-based authorization.
Parent topic: Using the Redshift Event Handler