8.12 Realtime Data Ingestion into Azure Databricks (unity catalog enabled) with GoldenGate for DAA

Overview

This document covers a step-by-step process showing how to ingest real-time data into Azure Databricks delta tables with GoldenGate for Distributed Applications and Analytics (GG for DAA).

Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.

GG for DAA Databricks handler uses the stage and merge data flow. In stage and merge, the change data is staged in a temporary location in microbatches and eventually merged into to the target table using Merge SQL.

All replication process is automatically handled by GoldenGate for Distributed Applications and Analytics (GG for DAA) Databricks Handler

GG for DAA supports Databricks workspaces configured with and without Databricks Unity Catalog. This Quickstart covers Databricks workspaces with Unity Catalog and creates external tables. External tables are file-backed tables that reference data stored in an external location. Azure Storage container is the external location for Azure Databricks.

8.12.1 Prerequisites for Databricks Replication with Unity Catalog

To successfully complete this Quicktart, you must have the following:

In this Quickstart, a sample trail file (named tr) which is shipped with GG for DAA is used. If you want to continue with sample trail file, it is located at GG_HOME/opt/AdapterExamples/trail/ in your GG for DAA instance.

GG for DAA will create the tables automatically.

8.12.2 Install Dependency Files

GG for DAA uses Java SDK provided by Databrick. You can download the SDKs using Dependency Downloader utility shipped with GG for DAA. Dependency downloader is a set of shell scripts that downloads dependency jar files from Maven and other repositories.

  1. In your GG for DAA VM, go to dependency downloader utility. It is located at GG_HOME/opt/DependencyDownloader/ and locate databricks.sh.
  2. Run databricks.sh with the required version. You can check the version and reported vulnerabilities in Maven Central. This document uses 2.6.38 which is the latest version when this Quickstart is published.
  3. A new directory is created in GG_HOME/opt/DependencyDownloader/dependencies , which is named as <databricks_jdbc_version>. Make a note of this directory as it will be used in the replicat properties. For example: /u01/app/ogg/opt/DependencyDownloader/dependencies/databricks-jdbc-2.6.36.

    Figure 8-82 Run databricks.sh with the required version

    Execute databricks.sh with the required version,
  4. Follow the same steps for Azure Storage dependencies. You can run azure_blob_storage.sh for downloading Azure Storage dependencies.

    Figure 8-83 Run azure_blob_storage.sh for downloading Azure Storage dependencies

    Run azure_blob_storage.sh for downloading Azure Storage dependencies

    A new directory will be created in GG_HOME/opt/DependencyDownloader/dependencies named as <azure-storage-blob_version>. Make a note of this directory as it will be used in the replicat properties. For example: /home/oracle/ggdaa23_home/opt/DependencyDownloader/dependencies/azure-storage-blob_12.13.0

8.12.3 Create a Replicat in Oracle GoldenGate for Distributed Applications and Analytics

To create a replicat in Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA):

  1. In the GG for DAA UI, in the Administration Service tab, click the + sign to add a replicat.

    Figure 8-84 Click + in the Administration Service tab.

    Click + in the Administration Service tab.
  2. Select the Classic Replicat Replicat Type and click Next. There are two different Replicat types available: Classic and Coordinated. Classic Replicat is a single threaded process whereas Coordinated Replicat is a multithreaded one that applies transactions in parallel.

    Figure 8-85 Select a Replicat Option

    Select a Replicat Option.
  3. Enter the Replicat options and click Next:
    1. Trail Name: Name of the required trail file (if using sample trail, provide as tr)
    2. Subdirectory: Enter GG_HOME/opt/AdapterExamples/trail/ if using the sample trail.
    3. Target: Databricks
    4. Available Staging Locations: Azure Data Lake Storage

      Figure 8-86 Provide Replicat Options and Select Target

      Provide Replicat Options and Select Target.
  4. Leave Managed Options as is and click Next.

    Figure 8-87 Managed Options

    Managed Options
  5. Enter Parameter File details and click Next. In the Parameter File, you can specify source to target mapping. If you’re using the sample trail file (tr) provide as followed: MAP QASOURCE.*, TARGET <your_schema_name>.*;

    Figure 8-88 Parameter File

    Parameter File
  6. In the Properties file, update the properties marked as TODO and click Create and Run.:
    # Properties file for Replicat DBRDELTA
    # Configuration to load GoldenGate trail operation records into Databricks using ADLS Gen2 stage.
    # Note: Recommended to only edit the configuration marked as  TODO
    gg.target=databricks
    gg.stage=abs
    # Azure Blob Event handler.
    #TODO: Edit Azure connection settings
    gg.eventhandler.abs.bucketMappingTemplate=<azure_adls_gen2_container_name>
    gg.eventhandler.abs.accountName=<azure_storage_account_name>
    gg.eventhandler.abs.accountKey=<azure_storage_account_key>
    
    # Databricks Event Handler.
    
    Edit your Databricks JDBC:
    #TODO: Edit JDBC ConnectionUrl
    gg.eventhandler.databricks.connectionURL=jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>];EnableArrow=0
    #TODO: Edit JDBC username or 'token'
    gg.eventhandler.databricks.UserName=token
    #TODO: Edit JDBC password
    gg.eventhandler.databricks.Password=<password>
    
    Provide path to dependency jar files that you downloaded in prerequisites
    #TODO: Edit the classpath to include Azure Blob Event Handler dependencies and Databricks JDBC driver.
    gg.classpath=/path/to/abs_dependencies/*: path/to/databricks_dependencies/*
    #TODO: Provide sufficient memory (at least 8GB).
    jvm.bootoptions=-Xmx8g -Xms1g	
    

    Figure 8-89 Properties File

    Properties File

    See Databricks for more replicat configuration details.

    When replicat starts successfully, it will be in running state. Note that by default, batch window is set to 3 mins. It may take a short moment for your data to be loaded to Databricks. You can go to action/details/statistics to see the replication statistics.

    Figure 8-90 Replication Statistics

    Replication Statistics
  7. You can go to Databricks and check the tables. It may take a short moment for tables to be created and loaded.

    Figure 8-91 Databricks Tables

    Databricks Tables

Note:

For more information, see Databricks.