Local Stream Pipelines

Nate · April 2023

SQL and Spark Pipelines are helpful features to combine multiple recipes together without materializing intermediate datasets. However, this feature is currently unavailable for chains of "local stream" recipes.

In my environment, I'm trying to process ~30 million records via local stream. I wish there were another way, but for various reasons, it's my only option right now. I have several types of recipes that need to be executed this way. I'm bottlenecked reading and writing to my database at ~10k records per second. The processing itself is quite fast, with the vast majority of the 4 hour processing time per recipe going to data transmission to and from the database.

It would save me 4 hours per recipe if multiple local stream recipes could be chained together in pipelines just like SQL and Spark Pipelines. Much of the data transformation I'm doing is linear and can be processed on a per-record level.

Local Stream Pipelines

New · Last Updated April 2023

Categories

Setup Info

Tags