1.13.3 Download Multiple Files from Cloud Storage

ETL jobs that are run in cloud storage providers can often generate output files in multiple parts. For example, an AWS Glue ETL job can generate up to 20 Parquet files due to the Apache Spark parallelism process.

EDQ 12.2.1.4.3 adds a new Download Multiple Files from Cloud Storage external task type, which supports downloading files from Oracle Object Store (OCI), Amazon Web Services (AWS), Azure Data Storage, and Google Cloud Platform. This provides a mechanism to download all files that are in a storage bucket folder to the server landing area for processing in EDQ. Once downloaded, you can use the Parquet data store to read multiple files as a single source.

User authentication details, and proxy/firewall settings where required, can be set as part of the file download task.

The complete set of options that may be set for a Multiple File Download task is as follows:

  • Source: lists all the supported cloud storage providers. Click the drop-down arrow to select the required storage provider.
  • Bucket URL: specifies the location of the storage bucket. The supported URL format is as follows:
    • OCI Object Storage: https://objectstorage.region.oraclecloud.com/n/tenancy/b/bucketname/o
    • AWS S3: https://bucketname.s3.region.amazonaws.com
    • Azure Storage: https://account.blob.core.windows.net/containername
    • Google Cloud Storage: https://storage.googleapis.com/storage/v1/b/bucketname/o/
  • Folder in bucket: specifies the folder path within the bucket.
  • File name pattern: allows a subset of files to be selected using * and ? wildcards.
  • Credentials: lists the stored credentials required for bucket access. Click the drop-down arrow to select the required stored credential. Leave this blank for buckets with public access.
  • Delete existing files: Select this check box if you want to delete all files in the landing area directory before download. If there is no landing area directory, this option is disabled to avoid inadvertent deletion of all files in the landing area.
  • Proxy Host: the name or IP address of the proxy/firewall server, where the server is not directly connected to the internet/network.

    For file downloads from Google Cloud Platform, the authentication requires a call to a token endpoint to obtain an Access Token. The proxy server set in the download task properties is used for file download only, it is not used for calls to the token endpoint. If a proxy server is required to access external sites, you need to define this using the standard Java properties. To do this, create a file named jvm.properties in your EDQ local configuration directory (oedq_local_home by default) and add entries similar to the following:

    https.proxyHost    = myproxy
    https.proxyPort    = 80
    http.nonProxyHosts = *.example.com|localhost
    

    Adjust the http.nonProxyHosts value to include hosts and domains that do not require a proxy.

  • Proxy Port: the port used to connect to the proxy/firewall server.

  • Directory in Landing Area: specifies the name that you wish to give to the directory in the landing area where you want to store the downloaded files. Note that if you want to put the files in a subfolder in the landing area, use forward slashes to specify a directory structure (for example, DownloadedData/downloadedfile.csv).

  • Use Project Specific Folder?: selecting this option will automatically put the files into a project-specific landing area. This is normally used where project permissions are in place, so that the landing area can only be accessed by processes within the same project.

Finally, you can name the file download task and give it an optional description.