A Pipeline is designed to model an algorithm. It represents a direct acyclic graph of input, intermediate, and output data nodes linked together by tasks. A pipeline is a set of tasks designed to perform a set of functions.

For instance, in a typical machine learning application, we may have several pipelines: a pipeline dedicated to preprocessing and preparing data, a pipeline for computing a training model, and a pipeline dedicated to scoring.

In the example

We have chosen to model only two pipelines corresponding to a manufacturer having to first predict the sales forecast, then based on the sales forecast to plan its production in its plant.


First, the sales pipeline (boxed in green in the picture) containing training and predict tasks.

Second, a production pipeline (boxed in dark gray in the picture) containing the planning task.

This problem has been modeled in two pipelines - one pipeline for the forecasting algorithm and one for the production planning algorithm. As a consequence, the two algorithms can have two different workflows. They can run independently, under different schedules. For example, one on a fixed schedule (e.g. every week) and one on demand, interactively triggered by end-users.

Note that the pipelines are not necessarily disjoint.

The attributes of a pipeline (the set of tasks) are populated based on the pipeline configuration PipelineConfig that must be provided when instantiating a new pipeline. (Please refer to the configuration details documentation for more details on configuration).

The next section introduces the Scenario concept.