Data Flow Integration
The Data Flow Support feature in ML Pipelines lets users integrate Data Flow Applications as steps within a pipeline.
With this new functionality, users can orchestrate the runs of Data Flow Applications (Apache Spark as a Service) alongside other steps in an ML Pipeline, streamlining large-scale data processing tasks.
When a pipeline containing a Data Flow step is run, it automatically creates and manages a new run of the Data Flow Application associated with that step. The Data Flow run is treated the same as any other step in the pipeline. When successfully completed, the pipeline continues its run, starting later steps as part of the pipeline's orchestration.
Using Data Flow Applications in ML Pipelines is straightforward:
- 1. Add a Data Flow Step
- Select the Data Flow step type in your ML Pipeline.
- 2. Select a Data Flow Application
- Select the Data Flow application you want to run as a step and configure options such as cluster size and environment variables.
- 3. Run the Pipeline
- Start a run of the pipeline. When the Data Flow step is reached, the associated application runs. When completed, the results are reflected in the step run, and the pipeline seamlessly proceeds to the next steps.
Policies
- Data Flow and Pipelines Integration.
- Pipeline Run Access to OCI Services.
- (Optional) Custom Networking policies, but only if using custom networking.
When a Data Flow run is triggered by a Pipeline run, it inherits the resource principal
datasciencepipelinerun
. Therefore,
granting privileges to datasciencepipelinerun
also grants privileges to the
code running inside the Data Flow run started by the
Pipeline run.Configuring Data Flow with Pipelines
Ensure you have the appropriate Policies applied.
Quick Start Guide
This is a step-by-step guide for creating a Data Flow pipeline.