Best practice - one prep recipe or multiple ones?

aw30
aw30 Dataiku DSS & SQL, Registered Posts: 49 ✭✭✭✭✭

We are adjusting a data set that needs a lot of changes in terms of cleaning the data, adding columns, etc.

From what I have seen it seems like the best practice would be to have 1 prep recipe and then group steps together so you don't get lost versus having multiple prep recipes that break the steps up.

My reasoning is that you are creating a data set after the recipe is processed and splitting up over a number of prep recipes takes more resources than including all the steps in one recipe.

Can someone confirm my understanding or identify elements that should be considered that may impact resources when adding steps to your flow?

Thank you for the help on this!!!

Best Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron
    Answer ✓

    Hi @aw30
    ,

    I see two aspects or sides to your question:

    • Best practices in terms of keeping the preparation flow simple and understandable
    • Optimizing the preparation flow to minimize the use of resources and/or computing time

    For small data (when you don't need a cluster to do the job, and data fits in memory, etc.t) I would focus in keeping the preparation flow as simple as possible but at the same time keep it understandable. As an example, when doing data preparation and cleaning some of our users would use a python recipe (instead of a visual approach), but we are actively recommending them to use visual recipes instead, because then the flow is easier to read by someone else. Also, we stress a lot that people should separate the cleaning steps from the analysis steps.

    However when you have big data, the focus will change to optimizing the use of resources and computing time, and in this case the "readability" becomes secondary. But how you optimize the cleaning will depend on the computational engines that you have available. For example, if you are using spark, most probably you are going to try to reduce at a minimum the number of "recipes" because of the overheads of setting up a spark instance.

    That is my view, but I would love to hear what other people think and what are their approaches.

    Cheers!

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
    Answer ✓

    Hi @aw30
    thanks for your question. Because the Prepare Recipe is such a popular recipe, there should be a lot of different perspectives to answer from. Although this will not directly answer your question, as a resource I do recommend this course from the Dataiku Academy: Advanced Prepare Recipe Usage

    I'd be interested to see though how others respond and if they agree with your logic, which I believe is pretty sound.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    Answer ✓

    @aw30
    , @Ignacio_Toledo

    I work with relatively small data sets. Usually less than a few million rows of data. Sometimes column counts grow to 100 columns or so.

    I find myself :

    • using a prepared recipe for each data source. Dealing with Dates & Times, Obvious data dirtiness, Loading from the original data source into a local analysis repository, usually PostgreSQL.
    • The following might happen multiple times and in various orders
      • Then I will do joins of various Data sources
      • Then windowing functions
    • Sometimes I'll have a final prepare recipe. This produces the master data set for modeling and visualizing.

    Sometimes interspersed in the above flow there may be some data enrichment models like clustering. Or prediction of something like feature home prices.

    For Modeling and Visualization:

    • We have a split for Validate & Train Steps.
    • Within the Visual Analysis (Lab)
      • There I will script steps to do model-specific filtering
      • Model-specific Feature Creation.

    I've not come to any conclusions about the best way to do "unit of analysis" changes that will be specifically used for visualization. (Sometimes this adds to the messiness of my project flows.)

    I've not fully explored the idea of having projects for ETL and separate projects for Modeling and Visulization, however, this is something I've considered for "re-usability".

    Interested in hearing what others think.

Setup Info
    Tags
      Help me…