Local Stream Pipelines

Status:

SQL and Spark Pipelines are helpful features to combine multiple recipes together without materializing intermediate datasets. However, this feature is currently unavailable for chains of "local stream" recipes.

In my environment, I'm trying to process ~30 million records via local stream. I wish there were another way, but for various reasons, it's my only option right now. I have several types of recipes that need to be executed this way. I'm bottlenecked reading and writing to my database at ~10k records per second. The processing itself is quite fast, with the vast majority of the 4 hour processing time per recipe going to data transmission to and from the database.

It would save me 4 hours per recipe if multiple local stream recipes could be chained together in pipelines just like SQL and Spark Pipelines. Much of the data transformation I'm doing is linear and can be processed on a per-record level.

Comment

never-displayed

Hint:

@ links to members, content

Local Stream Pipelines

Labels

Data Exploration and Preparation

Consistent display of chart title when hover on chart tab

I want to use Dataiku in Japanese.

Programmatic Git Support (Shell, Python API or Both)

Method to re-order V12 Visual ML override rules