The India User Group is live! Be a part of our first Indian user event: JOIN THE EVENT

Best practice - one prep recipe or multiple ones?

Solved!
aw30
Level 3
Best practice - one prep recipe or multiple ones?

We are adjusting a data set that needs a lot of changes in terms of cleaning the data, adding columns, etc.

From what I have seen it seems like the best practice would be to have 1 prep recipe and then group steps together so you don't get lost versus having multiple prep recipes that break the steps up.

My reasoning is that you are creating a data set after the recipe is processed and splitting up over a number of prep recipes takes more resources than including all the steps in one recipe. 

Can someone confirm my understanding or identify elements that should be considered that may impact resources when adding steps to your flow?

Thank you for the help on this!!!

0 Kudos
3 Solutions
Ignacio_Toledo

Hi @aw30,

I see two aspects or sides to your question:

  • Best practices in terms of keeping the preparation flow simple and understandable
  • Optimizing the preparation flow to minimize the use of resources and/or computing time

For small data (when you don't need a cluster to do the job, and data fits in memory, etc.t) I would focus in keeping the preparation flow as simple as possible but at the same time keep it understandable. As an example, when doing data preparation and cleaning some of our users would use a python recipe (instead of a visual approach), but we are actively recommending them to use visual recipes instead, because then the flow is easier to read by someone else. Also, we stress a lot that people should separate the cleaning steps from the analysis steps.

However when you have big data, the focus will change to optimizing the use of resources and computing time, and in this case the "readability" becomes secondary. But how you optimize the cleaning will depend on the computational engines that you have available. For example, if you are using spark, most probably you are going to try to reduce at a minimum the number of "recipes" because of the overheads of setting up a spark instance.

That is my view, but I would love to hear what other people think and what are their approaches.

Cheers!

View solution in original post

CoreyS
Community Manager
Community Manager

Hi @aw30 thanks for your question. Because the Prepare Recipe is such a popular recipe, there should be a lot of different perspectives to answer from. Although this will not directly answer your question, as a resource I do recommend this course from the Dataiku Academy: Advanced Prepare Recipe Usage

I'd be interested to see though how others respond and if they agree with your logic, which I believe is pretty sound. 

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

View solution in original post

tgb417
Neuron
Neuron

@aw30 ,  @Ignacio_Toledo 

I work with relatively small data sets.  Usually less than a few million rows of data.  Sometimes column counts grow to 100 columns or so.

I find myself :

  • using a prepared recipe for each data source.  Dealing with Dates & Times, Obvious data dirtiness, Loading from the original data source into a local analysis repository, usually PostgreSQL.
  • The following might happen multiple times and in various orders
    • Then I will do joins of various Data sources
    • Then windowing functions
  • Sometimes I'll have a final prepare recipe.  This produces the master data set for modeling and visualizing. 

Sometimes interspersed in the above flow there may be some data enrichment models like clustering.  Or prediction of something like feature home prices.

For Modeling and Visualization:

  • We have a split for Validate & Train Steps. 
  • Within the Visual Analysis (Lab)
    • There I will script steps to do model-specific filtering
    • Model-specific Feature Creation. 

I've not come to any conclusions about the best way to do "unit of analysis" changes that will be specifically used for visualization.  (Sometimes this adds to the messiness of my project flows.)

I've not fully explored the idea of having projects for ETL and separate projects for Modeling and Visulization, however, this is something I've considered for "re-usability".

Interested in hearing what others think.

--Tom

View solution in original post

3 Replies
Ignacio_Toledo

Hi @aw30,

I see two aspects or sides to your question:

  • Best practices in terms of keeping the preparation flow simple and understandable
  • Optimizing the preparation flow to minimize the use of resources and/or computing time

For small data (when you don't need a cluster to do the job, and data fits in memory, etc.t) I would focus in keeping the preparation flow as simple as possible but at the same time keep it understandable. As an example, when doing data preparation and cleaning some of our users would use a python recipe (instead of a visual approach), but we are actively recommending them to use visual recipes instead, because then the flow is easier to read by someone else. Also, we stress a lot that people should separate the cleaning steps from the analysis steps.

However when you have big data, the focus will change to optimizing the use of resources and computing time, and in this case the "readability" becomes secondary. But how you optimize the cleaning will depend on the computational engines that you have available. For example, if you are using spark, most probably you are going to try to reduce at a minimum the number of "recipes" because of the overheads of setting up a spark instance.

That is my view, but I would love to hear what other people think and what are their approaches.

Cheers!

View solution in original post

CoreyS
Community Manager
Community Manager

Hi @aw30 thanks for your question. Because the Prepare Recipe is such a popular recipe, there should be a lot of different perspectives to answer from. Although this will not directly answer your question, as a resource I do recommend this course from the Dataiku Academy: Advanced Prepare Recipe Usage

I'd be interested to see though how others respond and if they agree with your logic, which I believe is pretty sound. 

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

View solution in original post

tgb417
Neuron
Neuron

@aw30 ,  @Ignacio_Toledo 

I work with relatively small data sets.  Usually less than a few million rows of data.  Sometimes column counts grow to 100 columns or so.

I find myself :

  • using a prepared recipe for each data source.  Dealing with Dates & Times, Obvious data dirtiness, Loading from the original data source into a local analysis repository, usually PostgreSQL.
  • The following might happen multiple times and in various orders
    • Then I will do joins of various Data sources
    • Then windowing functions
  • Sometimes I'll have a final prepare recipe.  This produces the master data set for modeling and visualizing. 

Sometimes interspersed in the above flow there may be some data enrichment models like clustering.  Or prediction of something like feature home prices.

For Modeling and Visualization:

  • We have a split for Validate & Train Steps. 
  • Within the Visual Analysis (Lab)
    • There I will script steps to do model-specific filtering
    • Model-specific Feature Creation. 

I've not come to any conclusions about the best way to do "unit of analysis" changes that will be specifically used for visualization.  (Sometimes this adds to the messiness of my project flows.)

I've not fully explored the idea of having projects for ETL and separate projects for Modeling and Visulization, however, this is something I've considered for "re-usability".

Interested in hearing what others think.

--Tom

View solution in original post

A banner prompting to get Dataiku DSS
Public