Queue downstream execution if I already started running an upstream recipe

When developing pipelines, I often find myself kicking off an operation upstream only to wish I'd waited until I finished developing the downstream portion of the pipeline and built recursively. It would be really convenient to be able to simply queue up downstream operations to run after upstream ones complete, that way, if I run a downstream recipe recursively while an upstream recipe is already running, the downstream recipe will start from the upstream dataset as soon as it's done being built. This is especially nice when I have hours-long operations. I might be 2-3 hours into a 5-hour dataset build upstream, but want to run other time consuming portions of the pipeline as soon as it's done. I don't want to lose those hours by aborting the upstream recipe, but I also don't want to wait another two hours to hit the build button for my downstream recipes. With this feature, I'd be able to simply queue up my work, close the laptop for the night, and come back in the morning to a fully-processed dataset. While it's rare, sometimes I do even have multi-day build times for my data flows, and in those cases, this feature would be an enormous time saver.

4 Comments
CoreyS
Dataiker Alumni

Hey @natejgardner would you mind elaborating more on this idea by providing more detail and example on the need and use case? Thanks for your help here!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
Status changed to: Gathering Input

Hey @natejgardner would you mind elaborating more on this idea by providing more detail and example on the need and use case? Thanks for your help here!

Sure, basically, I'd like Dataiku's ability to build recursively to work even if an upstream build is already running. Right now, if I start running some long-running build upstream, then later realize I also want a downstream dataset to be built after the upstream one is done, I have to wait until the upstream one is done to queue the downstream dataset to be built. I might have a dataset with 100 million rows, but be able to work with results after the first 10,000 rows are loaded. I may create numerous downstream transformations of my data while the upstream dataset is still loading (and will continue to load for several hours or days). Once I'm satisfied with my flow, I'd like to be able to trigger the build of the downstream ("build flow outputs reachable from here") of a dataset to be built as soon as that dataset is finished being built without having to restart the long-running process. I'd also like the ability for recursive builds to use an upstream build that's already in progress when constructing their build queues. This way, if I already started a long-running build, I can set the rest of my flow to be built as soon as it's done, even if that happens in the middle of the night or on a weekend, without losing all the progress I've made so far. This compounds if I have multiple long-running steps in a flow.

Teradata connections are only able to stream around 5,000 records per second to Dataiku (until Teradata TPT Support is added, at least ๐Ÿ™‚), so anything that involves local processing on the Dataiku server runs pretty slowly. Even a small dataset of 10 million records thus takes 30 minutes to load. If I've already spent 15 minutes loading that data, and now have some downstream steps I'd like to also build, currently my options are to wait another 15 minutes for my dataset to finish building, then trigger the downstream datasets to be built by hand, or to abort the build and start over. Instead, I'd like to be able to enqueue my downstream steps to be build as soon as the first dataset is done building. Some of the datasets I work with take several days to load. By the time they're done, sometimes I've lost track of what I intended to do next. If I could just queue things up to run as soon as a build is complete, I don't need to worry about that, and can just work on other projects for a couple days until everything is done loading. This will reduce mental context switching and get results generated faster, and also allow me the flexibility to keep working on downstream transformations while long-running processes are happening.

Sure, basically, I'd like Dataiku's ability to build recursively to work even if an upstream build is already running. Right now, if I start running some long-running build upstream, then later realize I also want a downstream dataset to be built after the upstream one is done, I have to wait until the upstream one is done to queue the downstream dataset to be built. I might have a dataset with 100 million rows, but be able to work with results after the first 10,000 rows are loaded. I may create numerous downstream transformations of my data while the upstream dataset is still loading (and will continue to load for several hours or days). Once I'm satisfied with my flow, I'd like to be able to trigger the build of the downstream ("build flow outputs reachable from here") of a dataset to be built as soon as that dataset is finished being built without having to restart the long-running process. I'd also like the ability for recursive builds to use an upstream build that's already in progress when constructing their build queues. This way, if I already started a long-running build, I can set the rest of my flow to be built as soon as it's done, even if that happens in the middle of the night or on a weekend, without losing all the progress I've made so far. This compounds if I have multiple long-running steps in a flow.

Teradata connections are only able to stream around 5,000 records per second to Dataiku (until Teradata TPT Support is added, at least ๐Ÿ™‚), so anything that involves local processing on the Dataiku server runs pretty slowly. Even a small dataset of 10 million records thus takes 30 minutes to load. If I've already spent 15 minutes loading that data, and now have some downstream steps I'd like to also build, currently my options are to wait another 15 minutes for my dataset to finish building, then trigger the downstream datasets to be built by hand, or to abort the build and start over. Instead, I'd like to be able to enqueue my downstream steps to be build as soon as the first dataset is done building. Some of the datasets I work with take several days to load. By the time they're done, sometimes I've lost track of what I intended to do next. If I could just queue things up to run as soon as a build is complete, I don't need to worry about that, and can just work on other projects for a couple days until everything is done loading. This will reduce mental context switching and get results generated faster, and also allow me the flexibility to keep working on downstream transformations while long-running processes are happening.

AshleyW
Dataiker

Thanks for your idea, @natejgardner . Your idea meets the criteria for submission, we'll reach out should we require more information.

If youโ€™re reading this post and think that being able to more easily queue jobs would be a great capability to add to DSS, be sure to kudos the original post! Feel free to leave a comment in the discussion about how this would help you or your team.

Take care,
Ashley

Status changed to: In the Backlog

Thanks for your idea, @natejgardner . Your idea meets the criteria for submission, we'll reach out should we require more information.

If youโ€™re reading this post and think that being able to more easily queue jobs would be a great capability to add to DSS, be sure to kudos the original post! Feel free to leave a comment in the discussion about how this would help you or your team.

Take care,
Ashley

patwatt
Level 1

This would be a great feature to have for the same reasons @natejgardner mentioned. An easy way to queue jobs or flows to run when a specific task has completed would be a great way to avoid wasted time, long waits, and necessary context switching.

This would be a great feature to have for the same reasons @natejgardner mentioned. An easy way to queue jobs or flows to run when a specific task has completed would be a great way to avoid wasted time, long waits, and necessary context switching.