Is there a simple way to reapply flow/recipes to a second dataset?

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
Hi

I created a flow sequencing recipes for a first dataset. The goal is to create a prediction model at the end of the flow.

Next I need to apply my model to an input dataset that has the same schema as my first dataset.

I could not figure out how to apply the whole flow to the 2nd dataset and I had to copy the flow steps (analysis, joins, python) and apply them to the 2nd dataset.

Is there a more straight forward way? The issue is that if I add a new step or modify a recipe to enrich the dataset, it must be done twice.

Best regards

Geoff
Tagged:

Answers

  • jereze
    jereze Alpha Tester, Dataiker Alumni Posts: 190 ✭✭✭✭✭✭✭✭
    Hi,

    This is a good question. There is no out-of-the-box feature to do that today. But we are thinking about it.

    Actually, it works in a simple case since v2.0.0: when you have your data preparation in a script of an Analysis (called "Analyse"), deploying the model will also reproduce the script (ie. all the processors).

    If you have more a complex flow (with Python recipes, etc.), there is a solution: you can stack your two sources in a single dataset, then apply your flow of transformation, then split in two before modelling.

    Jeremy
  • okiriza
    okiriza Registered Posts: 5 ✭✭✭✭
    Hi, I am also interested in this feature. Is there any support for it in Dataiku v2.3?

    Thanks.
  • bored_panda
    bored_panda Registered Posts: 11 ✭✭✭✭
    Isn't it possible to export your project (project home page > actions > export this project), reimport it (DSS home page > mouse on the left > import) with a different name, and then change the input dataset ?
  • tjh
    tjh Registered Posts: 20 ✭✭✭✭
    This leads however to very strange behaviour in the new project
  • biodan25
    biodan25 Registered Posts: 2 ✭✭✭✭
    Is there any update ? I've downloaded DSS 4.3.1 to evaluate whether it will support a project that will have serial data tables that will be inserted as records in a database over time. I'd like to run a code recipe on each new 'record'. Note that a database record will reflect a file/table containing multiple columns and rows. Is this possible currently?
  • asdfasdf
    asdfasdf Registered Posts: 2

    I’m doing timeseries demand forecasting on ~10K products, with distinct behaviors. So one model per product.

    In Spark I’d just do a groupby(productID).apply(modelCode).

    What’s an efficient to code, efficient to run way to do this in pandas in Dataiku?

    Best to do partitioned model? The data is sitting on Snowflake, so for partitioning, do I cluster on productID? (Actually products are identified over 5 features).

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,166 Neuron

    Please post a new thread. This thread is 8 years old.

Setup Info
    Tags
      Help me…