9.2.17 Databricks
Overview
Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data,analytics, and AI solutions at scale.
9.2.17.1 Detailed Functionality
Replication to Databricks uses stage and merge data flow.
The change data from the Oracle GoldenGate trails is staged in micro-batches at a temporary staging location, typically a cloud object store.
The staged records are then merged into the Databricks target tables using a merge SQL statement.
Parent topic: Databricks
9.2.17.1.1 Staging location
The change data records from the GoldenGate trail files are formatted into Avro OCF (Object Container Format) and uploaded to the staging location. Change data can be staged in one of the following object stores based on the Databricks configuration.
- Azure Data Lake Storage (ADLS) Gen2
- AWS Simple Storage Service (S3)
- Google Cloud Storage (GCS)
Parent topic: Detailed Functionality
9.2.17.1.2 Database User Privileges
The database user used for replicating into Databricks has to be granted the following privileges:
CREATE
,INSERT
,UPDATE
,DELETE
, andTRUNCATE
on the target tables.CREATE
,ALTER
, andDROP
external tables.
Parent topic: Detailed Functionality
9.2.17.1.3 Prerequisites
- You must have Azure, Amazon Web Services, or Google Cloud Platform cloud accounts set up for Databricks.
- Azure storage accounts must have hierarchical namespace enabled for replication to Databricks on Azure.
- Databricks JDBC driver. For more information about the recommended JDBC driver version, see Databricks documentation.
Parent topic: Detailed Functionality
9.2.17.2 Configuration
The configuration of the Databricks replication properties is stored in the Replicat properties file.
- Automatic Configuration
- Authentication to Databricks
- Unity Catalog
- Create an External Location
- Compute Clusters Without Unity Catalog
- Classpath Configuration
- Proxy Configuration
- INSERTALLRECORDS Support
- Operation Aggregation
- Compressed Update Handling
- End-to-End Configuration
- Table Mapping
Parent topic: Databricks
9.2.17.2.1 Automatic Configuration
Databricks replication involves configuring multiple components, such as the File Writer Handler, ABS or S3 or GCS Event Handler, and the target Databricks Event Handler.
The Automatic Configuration functionality helps you to autoconfigure these components so that the manual configuration is minimal.
The properties modified by autoconfiguration is also logged in the handler log file.
To enable autoconfiguration to replicate to the Databricks target, set
the parameter gg.target=databricks
.
The parameter gg.stage
determines the staging
location.
If gg.stage
is unset, ADLS Gen2 will be used as the
staging location.
If gg.state is set to either abs, s3, or gcs, then ADLS Gen2, AWS S3, or GCS are respectively used as the staging locations.
The JDBC Metadata provider is also automatically enabled to retrieve target table metadata from Databricks.
Target tables are automatically created if missing.
- File Writer Handler Configuration
- ABS Event Handler Configuration
- s3 Event Handler Configuration
- GCS Event Handler Configuration
- Databricks Event Handler Configuration
Parent topic: Configuration
9.2.17.2.1.1 File Writer Handler Configuration
databricks
and its properties are automatically set to the required
values for Databricks.
Parent topic: Automatic Configuration
9.2.17.2.1.2 ABS Event Handler Configuration
The ABS Event Handler name is pre-set to the value
abs
and must be configured to match your
ADLS Gen2 configuration.
The following is an example of editing a property of the S3
Event Handler:
gg.eventhandler.abs.bucketMappingTemplate=container1
For more information about integrating with ADLS Gen2, see Azure Blob Storage Configuration.
Parent topic: Automatic Configuration
9.2.17.2.1.3 s3 Event Handler Configuration
s3
and must be configured to match your s3 configuration. The following is an
example of editing a property of the s3 Event Handler:
gg.eventhandler.s3.bucketMappingTemplate=bucket1
.
For more information about integrating with s3, see S3 Event handler configuration.
Parent topic: Automatic Configuration
9.2.17.2.1.4 GCS Event Handler Configuration
The GCS Event Handler name is pre-set to the value
gcs
and must be configured to match your
GCS configuration. The following is an example of editing a GCS
Event Handler property:
gg.eventhandler.gcs.bucketMappingTemplate=bucket1
The following is an example of editing a property of the S3 Event
Handler:
gg.eventhandler.abs.bucketMappingTemplate=container1
For more information about integrating with GCS, see GCS Event Handler Configuration.
Parent topic: Automatic Configuration
9.2.17.2.1.5 Databricks Event Handler Configuration
The Databricks Event Handler name is pre-set to the value
databricks
.
The following are the configuration properties available for the Databricks Event handler, the required ones must be changed to match your Databricks configuration.
Table 9-13
Properties | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.databricks.connectionURL |
Required |
jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>]
|
None | JDBC URL to connect to Databricks. See Databricks Authentication Methods. |
gg.eventhandler.databricks.UserName |
Optional | Supported database user name string. | None | Databricks database user or
token .
|
gg.eventhandler.databricks.Password |
Optional | Supported database password string. | None | Databricks database password or token value. |
gg.eventhandler.databricks.credential |
Optional | Storage Credential name. | None | External Storage credential name to access files on object storage such as ADLS Gen2, S3 or GCS. For more information, see Create a Storage Credential. |
gg.eventhandler.databricks.createTable |
Optional | true or
false |
true |
If the value is set to
true , then target tables are
automatically created if missing.
|
gg.eventhandler.databricks.maxConnections |
Optional | Integer value | 10 | Use this parameter to control the number of concurrent JDBC database connections to the target database. |
gg.eventhandler.databricks.connectionRetries |
Optional | Integer value | 3 | Specifies the number of times connections to the target data warehouse will be retried. |
gg.eventhandler.databricks.connectionRetryIntervalSeconds |
Optional | Integer value | 30 | Specifies the delay in seconds between connection retry attempts. |
gg.eventhandler.databricks.deleteInsert |
Optional | true or
false |
false |
If set to
true , Replicat will merge records using SQL
DELETE+INSERT statements instead of SQL
MERGE statement.
Note: Applicable only ifgg.compressed.update is set to
false .
|
gg.eventhandler.databricks.detectMissingBaseRow |
Optional | true or
false |
false |
Diagnostic parameter to find
UPDATE operations without base row. If set to
true , Replicat will ABEND if there are
UPDATE operations without base row. These rows will be collected
into another table that can be investigated.
|
gg.eventhandler.databricks.dropStagingTablesOnShutdown |
Optional | true or
false |
false |
If set to true , the
temporary staging tables created by Oracle
GoldenGate will be dropped on replicat graceful
stop.
|
gg.handler.databricks.fileRollInterval |
Optional | The default unit of measure is milliseconds. You can stipulate ms, s, m, h to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include 10000, 10000ms, 10s, 10m, or 1.5h. Values of 0 or less indicate that file rolling on time is turned off. | 3m (three minutes) | The parameter determines how often the data will be merged into
Databricks. Use with caution, the higher this value is the more data will need to be
stored in the memory of the Replicat process.
Note: Use the parameter with caution. Increasing its default value (3m) will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory.Note: |
gg.validate.keyupdate |
Optional | true or false |
false |
If set to true , Replicat will validate key update
operations (optype 115) and correct to normal update if no key values have changed.
Compressed key update operations do not qualify for merge.
|
gg.compressed.update |
Optional | true or false |
true |
If set the true , then this indicates that the
source trail files contain compressed update operations. If set to
false , then the source trail files are expected to contain
uncompressed update operations.
|
gg.eventhandler.databricks.deltaUniversalFormat
|
Optional | iceberg or '' (empty).
|
'' (empty)
|
If set to iceberg , Replicat will automatically
create target tables with Delta Lake Universal Format set to
iceberg . This allows you to read Delta tables with Iceberg reader
clients. Universal Format can be enabled only on Databricks tables managed by Unity
Catalog. This feature requires Databricks Runtime 14.3 LTS or above. See: https://docs.databricks.com/en/delta/uniform.html |
Parent topic: Automatic Configuration
9.2.17.2.2 Authentication to Databricks
Databricks JDBC connection URLs use the following format:
jdbc:databricks://<Host>:<Port>;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>]
Parent topic: Configuration
9.2.17.2.2.1 Compute Settings for the Databricks JDBC Driver
The driver requires the following compute resource configuration settings:
Table 9-14 Compute Settings for the Databricks JDBC Driver
Setting | Description |
---|---|
Host | The Databricks compute resource’s Server Hostname value. |
Port | 443 |
httpPath | The Databricks compute resource’s HTTP Path value. |
ssl | 1 |
- Connection Details for Compute Cluster
- Connection Details for Databricks SQL Warehouse
- Databricks Authentication Methods
- Databricks Personal Access Token
- Databricks Username and Password
- Use a Service Principal to Authenticate with Databricks (OAuth M2M)
Parent topic: Authentication to Databricks
9.2.17.2.2.1.1 Connection Details for Compute Cluster
To get the connection details for the Databricks compute cluster:
- Log in to your Databricks workspace.
- In the sidebar, click Compute.
- In the list of available warehouses, click the target cluster's name.
- On the Configuration tab, expand Advanced options.
- In the JDBC/ODBC tab, the Server hostname, Port, and HTTP path can be found.
Parent topic: Compute Settings for the Databricks JDBC Driver
9.2.17.2.2.1.2 Connection Details for Databricks SQL Warehouse
To get the connection details for the Databricks SQL warehouse:
- Log in to your Databricks workspace.
- In the sidebar, click SQL Warehouses.
- In the list of available warehouses, click the target warehouse’s name.
- In the Connection details tab, the Server hostname, Port, and HTTP path can be found.
Parent topic: Compute Settings for the Databricks JDBC Driver
9.2.17.2.2.1.3 Databricks Authentication Methods
The Databricks JDBC Driver supports the following authentication methods:
- Databricks personal access token
- Databricks username and password
- OAuth 2.0 tokens
- OAuth user-to-machine (U2M)
authentication
Note:
OAuth U2M or OAuth 2.0 browser-based authentication works only with applications that run locally. It does not work with server-based or cloud-based applications. - OAuth machine-to-machine (M2M) authentication
Note:
Oracle GoldenGate does not support authentication using OAuth user-to-machine (U2M). - OAuth user-to-machine (U2M)
authentication
Parent topic: Compute Settings for the Databricks JDBC Driver
9.2.17.2.2.1.4 Databricks Personal Access Token
To create a Databricks personal access token:
- In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
- Click Developer.
- Next to Access tokens, click Manage.
- Click Generate New Token.
- (Optional) Enter a comment that helps you to identify this token in the
future, and change the token’s default lifetime of 90
days.
To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).
- Click Generate.
- Copy the displayed token to a secure location, and then click Done.
Parent topic: Compute Settings for the Databricks JDBC Driver
9.2.17.2.2.1.5 Databricks Username and Password
Databricks username and password authentication is also known as Databricks basic authentication. Username and password authentication is possible only if single sign-on is disabled.
gg.eventhandler.databricks.UserName
and
gg.eventhandler.databricks.Password
properties to the
respective values.
Note:
In the 23ai release, Oracle has not yet certified authentication using OAuth 2.0 tokens, OAuth user-to-machine (U2M), and OAuth machine-to-machine (M2M)Parent topic: Compute Settings for the Databricks JDBC Driver
9.2.17.2.2.1.6 Use a Service Principal to Authenticate with Databricks (OAuth M2M)
To create a service principal at the workspace level, follow these steps:
- Login to the Databricks workspace as a workspace admin.
- Click your username in the top bar of the Databricks workspace and select Settings.
- Click Identity and Access.
- Next to Service Principals, click Manage.
- Click Add service principal.
- Click the drop-down arrow in the search box and then click Add New.
- Enter a name for the service principal.
- Click Add.
Parent topic: Compute Settings for the Databricks JDBC Driver
9.2.17.2.3 Unity Catalog
Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.
In Unity Catalog, the hierarchy of primary data objects flows from metastore to table or volume:
- Metastore: The top-level container for
metadata. Each metastore exposes a three-level namespace
(
catalog.schema.table
) that organizes your data.Note:
If your workspace includes a legacy Hive metastore, then the data in that metastore will still be available alongside data defined in Unity Catalog, in a catalog namedhive_metastore
. - Catalog: The first layer of the object hierarchy, used to organize your data assets.
- Schema: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.
- Tables, views, and volumes: At the lowest level in the data object hierarchy are tables, views, and volumes. Volumes provide governance for non-tabular data.
9.2.17.2.3.1 Managed Tables
Oracle GoldenGate replicates data to Databricks managed tables. Managed
tables use the DELTA
table format.
Parent topic: Unity Catalog
9.2.17.2.3.1.1 Tables Inside hive_metastore Catalog
If Unity Catalog was enabled on an existing Databricks workspace, then
the existing tables are available in the hive_metastore
catalog.
The tables under the hive_metastore
do not support
primary key.
Oracle GoldenGate Replicat MAP statement should use
KEYCOLS
to define the key columns required for stage/merge
replication.
Parent topic: Managed Tables
9.2.17.2.3.2 External Tables
External tables are file-backed tables that reference data stored in an external location.
External location is an object storage such as ADLS Gen2, AWS S3, or GCS.
- Storage Credential: A storage credential allows Databricks to access data in cloud storage.
- External locations contain a reference to a storage credential and a cloud storage path.
Parent topic: Unity Catalog
9.2.17.2.3.3 Create a Storage Credential
A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant.
- Storage Credential to Access Azure Storage Account
- Storage Credential to Access Google Storage Account
Parent topic: Unity Catalog
9.2.17.2.3.3.1 Storage Credential to Access Azure Storage Account
To create a storage credential to access an Azure storage account:
- Create an Azure resource called “Access Connector for Azure Databricks”
- In the Azure portal, search for Access Connector for Azure Databricks and select the Access Connector for Azure Databricks service.
- Following the steps to create a connector resource.
- In the Azure portal, search for “Storage accounts” and select the storage account that should be used as a GoldenGate staging location.
- In the sidebar, click Access Control (IAM).
- On Role assignments, click +Add, select Add role assigment from the drop-down menu.
- In the search bar, enter Storage Blob Data Contributor, select the role from the list, and click Next.
- Click Members: +Select members, in the search bar, enter the name of the Azure Databricks connector resource, and click Select.
- Back in the Azure portal, search for Azure Databricks connector resource, and note down the Resource ID.
- Create a storage credential in Databricks:
- Log in to your Databricks workspace.
- In the sidebar, click Catalog.
- Click External Data, then click Storage Credentials.
- Click Create Storage Credential.
- In the Create Storage Credential dialog, enter the following
details:
- Credential Type: Choose Azure Managed Identity from the drop-down menu.
- Storage credential name: Enter a name for the storage credential.
- Access connector ID: Enter the resource ID of the Azure Databricks connector resource.
- Click Advanced Options, check the box Limit to read-only use.
- Click Create.
Note:
You can create multiple storage credentials for different storage accounts.Parent topic: Create a Storage Credential
9.2.17.2.3.3.2 Storage Credential to Access Google Storage Account
To create a storage credential to access an Google storage account:
- Create a storage credential to access Google Cloud Storage:
- Log in to your Databricks workspace.
- In the sidebar, click Catalog.
- Click External Data, then click Storage Credentials.
- Click Create Storage Credential.
- In the Create Storage Credential dialog, enter the
following details:
- Credential Type: Choose Google Service Account from the drop-down menu.
- Storage credential name: Enter a name for the storage credential.
- Click Advanced Options, check the box Limit to read-only use.
- Click Create.
- On the Storage credential created dialog, make a note of the service account ID, which is in the form of an email address, and click Done.
- Configure permissions for the service account:
- Go to the Google Cloud console and open the GCS bucket that you want to
access from your Databricks workspace.
The bucket should be in the same region as your Databricks workspace.
- On the Permission tab, click + Grant access and assign the
service account the following roles:
- Storage Legacy Bucket Reader
- Storage Object Admin
Use the service account’s email address as the principal identifier.
- Go to the Google Cloud console and open the GCS bucket that you want to
access from your Databricks workspace.
Note:
You can create multiple storage credentials for different storage accounts.Parent topic: Create a Storage Credential
9.2.17.2.4 Create an External Location
External locations associate Unity Catalog storage credentials with cloud object storage containers.
External locations are used to define managed storage locations for catalogs and schemas, and to define locations for external tables and external volumes.
Oracle GoldenGate recommends creating an external location to simplify configuration, but you may decide to skip this step.
- External Location to Access Azure Storage Account
- External Location to Access Google Storage Account
Parent topic: Configuration
9.2.17.2.4.1 External Location to Access Azure Storage Account
To create an external location:
- Log in to your Databricks workspace.
- In the sidebar, click Catalog.
- Click External Data, then click on External Locations.
- Click Create Location.
- In the Create a new external location form, enter the
following details:
- External location name: Enter a name for the external location.
- Storage credential: Choose the storage credential that you created earlier from the drop-down menu.
- URL: Enter the bucket path
that you want to use as the external location. For example:
abfss://[email protected]/
- Click Advanced Options, check the box Limit to read-only use.
- Click Create.
Note:
If an external location is created and tied to a storage credential, then
there is no need to set the event handler property
gg.eventhandler.databricks.credential
.
Parent topic: Create an External Location
9.2.17.2.4.2 External Location to Access Google Storage Account
To create an external location:
- Log in to your Databricks workspace.
- In the sidebar, click Catalog.
- Click External Data, and then click External Locations.
- Click Create Location.
- In the Create a new external location form, enter the
following details:
- External location name: Enter a name for the external location.
- Storage credential: Choose the storage credential that you created earlier from the drop-down menu.
- URL: Enter the bucket path
that you want to use as the external location. For example:
gs://gcs-bucket/ogg
- Click Advanced Options, check the box Limit to read-only use.
- Click Create.
Parent topic: Create an External Location
9.2.17.2.5 Compute Clusters Without Unity Catalog
Legacy Databricks compute clusters may not have Unity Catalog support or some compute clusters do not have Unity Catalog enabled.
These compute clusters cannot use storage credentials or external locations.
- Log in to your Databricks workspace.
- In the sidebar, click Compute.
- In the list of available clusters, click the target cluster’s name.
- On the Configuration tab, expand Advanced options.
- In the Spark tab, you can specify the spark configuration for the cluster.
- Spark Configuration to Access Azure Storage Account
- Spark Configuration to Access Google Cloud Storage
- Spark Configuration to Access AWS S3
- Creating Databricks Secrets
Parent topic: Configuration
9.2.17.2.5.1 Spark Configuration to Access Azure Storage Account
Access to ADLS Gen2 can be configured using the storage account name and key.
Storage account key needs can be securely stored in Databricks secrets.
fs.azure.account.key.storageaccountname.dfs.core.windows.net {{secrets/gg/azureAccountKey-for-storageaccountname}}
In this example, storageaccountname
is the Azure storage account
name, and {{secrets/gg/azureAccountKey-for-storageaccountname}}
is
the Databricks secret that contains the storage account key.
Parent topic: Compute Clusters Without Unity Catalog
9.2.17.2.5.2 Spark Configuration to Access Google Cloud Storage
Access to Google Cloud Storage can be configured using a Google service account that has permission to access the storage bucket.
- Service account email
- Google project-id.
- Service account private key.
- Service account private key id.
Service account private key and service
account key id can be securely stored in Databricks secrets.
The
following is an example of setting the spark configuration for Databricks on
Azure:
google.cloud.auth.service.account.enable true fs.gs.auth.service.account.email <client-email> fs.gs.project.id <project-id> fs.gs.auth.service.account.private.key {{secrets/scope/gsa_private_key}} fs.gs.auth.service.account.private.key.id {{secrets/scope/gsa_private_key_id}}
Parent topic: Compute Clusters Without Unity Catalog
9.2.17.2.5.3 Spark Configuration to Access AWS S3
Access to AWS S3 can be configured in multiple ways. The following environment variables can be set in the Spark configuration to access S3:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SECRET_ACCESS_KEY={{secrets/gg/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/gg/aws_access_key_id}}
Parent topic: Compute Clusters Without Unity Catalog
9.2.17.2.5.4 Creating Databricks Secrets
To create a Databricks secret:
- Log in to your Databricks workspace.
- In the sidebar, click Compute.
- In the list of available clusters, click the target cluster’s name.
- On the Apps tab, click Web Terminal. This should open up a terminal session tab in the web browser.
- Create a Databricks secrets scope using
databricks secrets create-scope <scope-name>
. For example,databricks secrets create-scope gg
. - Create a secret using
For example:databricks secrets put-secret --json '{ "scope": "<scope-name>", "key": "<key-name>", "string_value": "<secret>" }'
databricks secrets put-secret --json '{ "scope": "gg", "key": "storageaccountname", "string_value": "---------storage-account-key-----------------" }'
Note:
These commands were run using Databricks Runtime (DBR) 15.0.
Note:
On Unity Catalog enabled workspaces, the tables inside thehive_metastore
catalog cannot use
external location or external storage credentials.
The tables inside
the hive_metastore
catalog also require the spark cluster
configuration listed in this section.
Parent topic: Compute Clusters Without Unity Catalog
9.2.17.2.6 Classpath Configuration
Parent topic: Configuration
9.2.17.2.6.1 Dependencies
<OGGDIR>/DependencyDownloader/databricks.sh
.
Running
this script without any input parameters will download the JDBC driver
version 2.6.36. The script also can be run with a single argument to
download a specific version of the JDBC driver.
For more information
about Dependency Downloader, see Dependency Downloader.
Alternatively, you
can also download the JDBC driver from Maven central using the following
co-ordinates:
<dependency> <groupId>com.databricks</groupId> <artifactId>databricks-jdbc</artifactId> <version>2.6.36</version> </dependency>
- If staging location is set to ADLS Gen2, classpath should include the ABS Event handler dependencies. See ABS Event handler dependencies.
- If staging location is set to S3, classpath should include the S3 Event handler dependencies. See S3 Event Handler.
- If staging location is set to GCS, classpath should include the GCS Event handler dependencies. See GCS Event Handler.
gg.classpath
configuration parameter to
include the path to the object store Event Handler dependencies (if external
stage is in use) and the Databricks JDBC driver.
Parent topic: Classpath Configuration
9.2.17.2.7 Proxy Configuration
When the Replicat process runs behind a proxy server, the JDBC connection URL must be appended with the following property values:
UseProxy
ProxyHost
ProxyPort
jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>]
;EnableArrow=0;UseProxy=1;ProxyHost=<proxy_host>;ProxyPort=<proxy_port>
Parent topic: Configuration
9.2.17.2.8 INSERTALLRECORDS Support
Stage and merge targets supports INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in
Reference for Oracle GoldenGate. Set the
INSERTALLRECORDS
parameter in the Replicat parameter file
(.prm
). Set the INSERTALLRECORDS
parameter in
the Replicat parameter file (.prm
)
Setting this property directs the Replicat process to use bulk insert
operations to load operation data into the target table. You can tune the batch size
of bulk inserts using the File Writer property
gg.handler.databricks.maxFileSize
. The default value is set to
1GB. The frequency of bulk inserts can be tuned using the File writer property
gg.handler.databricks.fileRollInterval
, the default value is
set to 3m (three minutes).
Parent topic: Configuration
9.2.17.2.9 Operation Aggregation
Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.
9.2.17.2.9.1 In-memory Operation Aggregation
- Operation records can be aggregated in-memory, this is the default configuration.
-
User can tune the frequency of merge interval using the File writer
gg.handler.databricks.fileRollInterval
property, the default value is set to 3m (three minutes). - Operation aggregation in-memory requires additional JVM memory configuration.
Parent topic: Operation Aggregation
9.2.17.2.9.2 Operation Aggregation using SQL
- To use SQL aggregation, it is mandatory that the trail files
contain uncompressed
UPDATE
operation records, which means that theUPDATE
operations contain full image of the row being updated. - Operation aggregation using SQL can provide better throughput if the trails files contains uncompressed update records.
- Replicat can aggregate operations using SQL statements by
setting the
gg.aggregate.operations.using.sql=true
. - User can tune the frequency of merge interval using the File
writer
gg.handler.databricks.fileRollInterval
property, the default value is set to 3m (three minutes). - Operation aggregation using SQL does not require additional JVM memory configuration.
Parent topic: Operation Aggregation
9.2.17.2.10 Compressed Update Handling
gg.compressed.update
can
be set to true or false to indicate compressed/uncompressed update
records.
Parent topic: Configuration
9.2.17.2.10.1 MERGE Statement with Uncompressed Updates
In some use cases, if the trail contains uncompressed update
records, then the MERGE SQL
statement can be
optimized for better performance by setting
gg.compressed.update=false
If you want to use DELETE+INSERT SQL
statements instead of a MERGE SQL
statement, then
set
gg.eventhandler.databricks.deleteInsert=true
.
Parent topic: Compressed Update Handling
9.2.17.2.11 End-to-End Configuration
The following is an end-end configuration example which uses autoconfiguration.
The sample properties file can also be found in the
directory
<OGGDIR>/AdapterExamples/big-data/databricks/
.
- dbx-az.props: Configuration using ADLS Gen2 stage for Databricks on Azure.
- dbx-s3.props: Configuration using S3 stage for Databricks on AWS.
- dbx-gcs.props: Configuration using GCS stage for Databricks on GCP.
Parent topic: Configuration
9.2.17.2.11.1 Databricks on Azure
# Configuration to load GoldenGate trail operation records into Databricks using ADLS Gen2 stage. # Note: Recommended to only edit the configuration marked as TODO gg.target=databricks # Azure Blob Event handler. gg.eventhandler.abs.bucketMappingTemplate=<azure_adls_gen2_container_name> gg.eventhandler.abs.accountName=<azure_storage_account_name> gg.eventhandler.abs.accountKey=<azure_storage_account_key> # Databricks Event Handler. #TODO: Edit JDBC ConnectionUrl gg.eventhandler.databricks.connectionURL=jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>];EnableArrow=0 #TODO: Edit JDBC username or 'token' gg.eventhandler.databricks.UserName=token #TODO: Edit JDBC password gg.eventhandler.databricks.Password=<password> #TODO: Edit the classpath to include Azure Blob Event Handler dependencies and Databricks JDBC driver. gg.classpath=$THIRD_PARTY_DIR/abs/*:$THIRD_PARTY_DIR/databricks/* #TODO: Provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms1g
Parent topic: End-to-End Configuration
9.2.17.2.11.2 Databricks on AWS
# Configuration to load GoldenGate trail operation records into Databricks using S3 stage. # Note: Recommended to only edit the configuration marked as TODO gg.target=databricks gg.stage=s3 #The S3 Event Handler #TODO: Edit the AWS region gg.eventhandler.s3.region=<aws region> #TODO: Edit the AWS S3 bucket gg.eventhandler.s3.bucketMappingTemplate=<s3 bucket> # Databricks Event Handler. #TODO: Edit JDBC ConnectionUrl gg.eventhandler.databricks.connectionURL=jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>];EnableArrow=0 #TODO: Edit JDBC username or 'token' gg.eventhandler.databricks.UserName=token #TODO: Edit JDBC password gg.eventhandler.databricks.Password=<password> #TODO: Edit the classpath to include GCS Event Handler dependencies and Databricks JDBC driver. gg.classpath=$THIRD_PARTY_DIR/s3/*:$THIRD_PARTY_DIR/databricks/* #TODO: Provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms1g
Parent topic: End-to-End Configuration
9.2.17.2.11.3 Databricks on GCP
# Configuration to load GoldenGate trail operation records into Databricks using GCS stage. # Note: Recommended to only edit the configuration marked as TODO gg.target=databricks gg.stage=gcs ## The GCS Event handler #TODO: Edit the GCS bucket name gg.eventhandler.gcs.bucketMappingTemplate=<gcs bucket> #TODO: Edit the GCS credentialsFile gg.eventhandler.gcs.credentialsFile=<oggbd-project-credentials.json> # Databricks Event Handler. #TODO: Edit JDBC ConnectionUrl gg.eventhandler.databricks.connectionURL=jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>];EnableArrow=0 #TODO: Edit JDBC username or 'token' gg.eventhandler.databricks.UserName=token #TODO: Edit JDBC password gg.eventhandler.databricks.Password=<password> #TODO: Edit the classpath to include GCS Event Handler dependencies and Databricks JDBC driver. gg.classpath=$THIRD_PARTY_DIR/gcs/*:$THIRD_PARTY_DIR/databricks/* #TODO: Provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms1g
Parent topic: End-to-End Configuration
9.2.17.2.12 Table Mapping
If the MAP statement does not specify a target catalog, then the default
catalog for the Databricks workspace will be used. The handler will log the default
catalog during initialization. Example log message: Connection catalog is
set to [dbx-catalog]
.
Parent topic: Configuration
9.2.17.2.12.1 Mapping Table
Table 9-15
MAP statement in the Replicat parameter file | Databricks Catalog | Databricks Schema | Databricks Table |
---|---|---|---|
MAP SCHEMA_1.TABLE_1, TARGET
"schema_1"."table_1"; |
Default catalog | schema_1 |
table_1 |
MAP DB_1.SCHEMA_1.TABLE_1, TARGET
"db_1"."schema_1"."table_1" |
db_1 |
schema_1 |
table_1 |
Parent topic: Table Mapping
9.2.17.3 Troubleshooting and Diagnostics
- Unsupported Databricks data types:
ARRAY
MAP
STRUCT
INTERVAL
VOID
-
Databricks JDBC Driver Exception:
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.ResourceBundleLookup
While using the Databricks JDBC driver
2.6.36
, we have come across this exception. Setting the propertyEnableArrow=0
is the workaround.Oracle is working with Databricks to address this.
-
org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot add column ‘<column name>’ with type ‘void’. Please explicitly specify a non-void type.
If the Databricks table defined a column with data type
VOID
, all the DML operations will fail even if theVOID
column is not mapped by Oracle GoldenGate.To proceed with the replication, you should drop the
VOID
column from the target table. - Databricks INTERVAL data type:
If the target table contains an
INTERVAL
type, the Databricks JDBC driver ignores the presence of such a column.You cannot map any source column to an
INTERVAL
type. You should also avoid definingINTERVAL
types in the target table without a default value and with aNOT NULL
constraint. - Connectivity issues to Databricks:
- Validate JDBC connection URL, username, and password.
- Check proxy configuration if running Replicat process behind a proxy.
- DDL not applied on the target table: GoldenGate for Distributed Applications and Analytics does not support DDL replication.
- SQL Errors: In case there are any errors while executing any SQL, the SQL statements along with the bind parameter values are logged into the GoldenGate for Distributed Applications and Analytics handler log file.
- Co-existence of the components: When using an external stage location
(ADLS Gen 2 or S3 or GCS), the location/region of the machine where Replicat
process is running and the object store’s region will have an impact on the
overall throughput of the apply process.
For the best possible throughput, the components need to be located ideally in the same region or as close as possible.
- Replicat ABEND due to partial LOB records in the trail file: Oracle
GoldenGate for Distributed Applications and Analytics does not support
replication of partial LOB data.
The trail file needs to be regenerated by Oracle Integrated capture using
TRANLOGOPTIONS FETCHPARTIALLOB
option in the Extract parameter file. - When replicating to more than ten target tables, the parameter
maxConnections
can be increased to a higher value which can improve throughput. - Identity column in the target table:
- If the target table contains an Identity column, then the
MERGE
statement would fail because the Identity cannot be updated.Example Error Message:
To proceed, review the following points:`Query: MERGE INTO***, Error message from Server: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: UPDATE on IDENTITY column "col9" is not supported.`
- If the Identity column is defined using
GENERATED ALWAYS AS IDENTITY
, then Replicat would result in the following error: Error message from Server:org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Providing values for GENERATED ALWAYS AS IDENTITY column col9 is not supported
. To proceed further, the Identity column should be excluded from mapping on the source database usingCOLSEXCEPT
or removed from the target table. - If the Identity column is defined using
GENERATED BY DEFAULT AS IDENTITY
, then Replicat can be configured to useDELETE-INSERT
instead ofMERGE
by settinggg.eventhandler.databricks.deleteInsert=true
provided the prerequisites for enablingDELETE-INSERT
are met.
- If the target table contains an Identity column, then the
Parent topic: Databricks