Start Flow from specific dataset

Tuong-Vi
Tuong-Vi Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Neuron 2020, Dataiku DSS Adv Designer, Registered, Neuron 2021, Neuron 2022 Posts: 33 Partner

Hello,

As dss developper, I would like to be able to run a part of the flow from a dataset I can choose (start point).

Actually, I can run one recipie ("build only this dataset") or restart all the flow from the beginning (it can be long sometimes). In the flow, I can select "Build flow outputs reachable from here". It will be useful to have also "Build flow from here until the last dataset".

The top will be the option "force rebuild dependencies with customizable start point / end point ",

Regards

8
8 votes

Released · Last Updated

Comments

  • AshleyW
    AshleyW Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker

    Hi @Tuong-Vi
    ,

    I'm not sure you meant was the difference between "build flow outputs reachable from here" and "build flow from here until the last dataset". Could you clarify that?

    Best,

    Ashley

    Note: I'll log the 'top option' you mentioned in which you'd like to be able to build a section of the Flow you've selected. Based on previous discussions we've had around this idea, it looks unlikely that we'll implement it, but I've added your idea to the existing group of requests.

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 318 Neuron

    Hello @Tuong-Vi
    and @AshleyW
    . I've wanted to be able to do this as well.

    I would describe it as building the datasets that are selected if I right click on a dataset and click "Select all downstream".

    Maybe it could be implemented by offering an option to build selected parts of the flow. So one could "Select all downstream" and then "Build selected".

    Note that since I typically work with SQL datasets I always use the force rebuild option.

    Also I thought that "build flow outputs reachable from here" might do this (i.e., build downstream datasets) but it seems to want to rebuild more than that. At least when I select force rebuild dependencies. If I select build required dependencies then none the downstream SQL datasets are rebuilt. So it'll rebuild way more than I want it to or nothing at all. I don't use it.

    It'd be nice to have the requested option in a Scenario step as well. I can work around this by including a bunch of build only this dataset (with force rebuild). So I can accomplish what I need, it's just more difficult to do than it needs to be.

    Marlan

  • Tuong-Vi
    Tuong-Vi Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Neuron 2020, Dataiku DSS Adv Designer, Registered, Neuron 2021, Neuron 2022 Posts: 33 Partner

    Hello,

    I'm agree with you @Marlan
    , and in a Scenario, this option will be useful too. An another option I would like to see in scenario is action for "all datasets". It will avoid to select dataset sequentially for global action like synchronize hive metastore or build metrics.... The best : check/unckeck dataset in the list (dataset to compute)

    Sans titre.png

  • AshleyW
    AshleyW Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker

    Hi @Tuong-Vi
    ,

    This idea has been added to our backlog.

  • Tuong-Vi
    Tuong-Vi Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Neuron 2020, Dataiku DSS Adv Designer, Registered, Neuron 2021, Neuron 2022 Posts: 33 Partner

    Hello, thank you for your attention to this matter,

    have a nice day

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,877 Neuron

    So not exactly what you ask for but we have come up with a solution to tackle the issue that Scenarios are not re-runable from the latest point of failure. We now design pour scenarios as follows:

    1. What you would create as a scenario step we create as a separate scenario
    2. We name each of this scenarios with a preffix number (ie 010_Build_Inputs, 020_Calculate_Metrics) leaving gaps so we can insert steps in the middle. This also makes them alpha sorted
    3. We then have a "Full Run" scenario that calls/run each of the above scenarios as a step.

    This design allows us to quickly resume running a long scenario from the last step that failed without having to enable / disable steps in the Full Run scenario. Using a design like the one I suggested will allow you to start running a flow from any arbitrary point.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    If it is OK with folks.

    I'd like to do a +1 on the idea that from a Scenario one can choose a step type that "Build Flow outputs reachable from here". I have a number of projects that actually fan out at the end rather than coming down to a single dataset or a very small number of end datasets.

    In my use case, the project is a data normalization project. It is a project from which we serve cleaned-up "normalized" datasets to provide other projects with useful data to start. Rather than having every project go after the data on its own. Somewhat Complex flow that fans out.  Pointers to datasets One would re-build to get all of the other datasets in the project to re-build

  • Katie
    Katie Dataiker, Registered, Product Ideas Manager Posts: 106 Dataiker

    Hello @Tuong-Vi
    !

    Thank you for your feedback. Wanted to let you know that this is something our dev team is currently working on

    Stay tuned for more updates once it's released!

    Katie

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @ktgross15
    ,

    It is great to hear that there will be some work on this.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,877 Neuron
  • Katie
    Katie Dataiker, Registered, Product Ideas Manager Posts: 106 Dataiker

    Hello all!

    As promised, we have just released improvements to make schema propagation & building behavior even more intuitive in V12! You can now run downstream & propagate schema from the "run" button within a recipe, or from the flow's "build" button.

    See more detail in our reference docs & knowledge base article.

    Let me know if you have any questions!

    Katie

  • Tuong-Vi
    Tuong-Vi Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Neuron 2020, Dataiku DSS Adv Designer, Registered, Neuron 2021, Neuron 2022 Posts: 33 Partner

    Thanks a lot @ktgross15
    , the teams will save a lot of time

Setup Info
    Tags
      Help me…