Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Do I need to create a dataset for every recipe?

Solved!
DogaS
Level 3
Do I need to create a dataset for every recipe?

I think having to create a dataset for every intermediary step in Dataiku is not very efficient, especially from a data storage standpoint. I think it's causing a lot of redundant data to be stored in the process of creating a workflow.

Is there any way of combining and executing multiple recipes together or not storing the data physically in all intermediary steps if there are multiple recipes that create a final output?

E.g. if I am using first group by recipe to get the counts of certain categories, and then using sort recipe to sort by the highest frequency, can I skip generating a dataset (or at least not actually store it) and only have the output of the sorting stored?


Operating system used: Windows 10

0 Kudos
1 Solution
Turribeach

The Dataiku design decision to persist every dataset at every step is one of the key unique selling points of the product. This allows less experienced users, aka clickers, to be able to see changes in the data pipeline making it much easier understand them. It also allows them to slowly build a very complicated data transformations in a visual step-by-step way along with assisting in debugging and enabling data quality checks at each step. Coders (the other user persona Dataiku tries to appeal to) may find this pattern wasteful of storage and too verbose. Having said that Dataiku does not force you to design the flow in a specific way. If you are a coder and prefer to bundle up multiple steps in a single recipe you are welcome to do so by using the available code recipes (Python, SQL, Shell, Spark, etc). However please note that your project will not easy to understand by other users that are not coders like you are (aka clickers: testers, data analysts, BAs, PMs, etc).

There is also a hybrid option called SQL pipelines. This option allows to execute a bunch of recipes as a single entity without having to persist the individual recipe outputs. There are several caveats to use this option, you have to have all your recipes run in SQL Engine mode which means that a lot of code recipes and some visual recipes will not be compatible with SQL Pipelines. You also have to have all the recipes use the same Dataiku connection. And finally the fact that you won't be persisting the intermediate outputs leads to harder to debug problems in your flow although you can temporarily turn SQL pipelines off in the datasets you need to check something. And of course SQL pipelines make the project harder to understand by clickers and prevents you from implementing data quality checks in your intermediate datasets as these are not persisted.

Ultimately this is an issue of having the cake and eat it. You can't have it both ways and you need to make a decision that works best for your use case. This again is one the Dataiku strengths. You can make this choice in a project by project basis and even mix and match these patterns on the same flow allowing you to use the solution that best fits the problem you are trying to solve. 

One thing to note is that in these days data storage is very cheap and Data Engineers / Data Scientists are not. So in my view Dataiku's way of doing things ends up saving a lot more in human costs by allowing more users use their product than the extra storage costs it adds by the intermediate results being persisted. Finally whereas before there were some concerns around how much data you can store in a database these days there are plenty of database technologies like Snowflake, Databricks or Google BigQuery that can store and handle  virtually unlimited amounts of data.

View solution in original post

4 Replies
Turribeach

The Dataiku design decision to persist every dataset at every step is one of the key unique selling points of the product. This allows less experienced users, aka clickers, to be able to see changes in the data pipeline making it much easier understand them. It also allows them to slowly build a very complicated data transformations in a visual step-by-step way along with assisting in debugging and enabling data quality checks at each step. Coders (the other user persona Dataiku tries to appeal to) may find this pattern wasteful of storage and too verbose. Having said that Dataiku does not force you to design the flow in a specific way. If you are a coder and prefer to bundle up multiple steps in a single recipe you are welcome to do so by using the available code recipes (Python, SQL, Shell, Spark, etc). However please note that your project will not easy to understand by other users that are not coders like you are (aka clickers: testers, data analysts, BAs, PMs, etc).

There is also a hybrid option called SQL pipelines. This option allows to execute a bunch of recipes as a single entity without having to persist the individual recipe outputs. There are several caveats to use this option, you have to have all your recipes run in SQL Engine mode which means that a lot of code recipes and some visual recipes will not be compatible with SQL Pipelines. You also have to have all the recipes use the same Dataiku connection. And finally the fact that you won't be persisting the intermediate outputs leads to harder to debug problems in your flow although you can temporarily turn SQL pipelines off in the datasets you need to check something. And of course SQL pipelines make the project harder to understand by clickers and prevents you from implementing data quality checks in your intermediate datasets as these are not persisted.

Ultimately this is an issue of having the cake and eat it. You can't have it both ways and you need to make a decision that works best for your use case. This again is one the Dataiku strengths. You can make this choice in a project by project basis and even mix and match these patterns on the same flow allowing you to use the solution that best fits the problem you are trying to solve. 

One thing to note is that in these days data storage is very cheap and Data Engineers / Data Scientists are not. So in my view Dataiku's way of doing things ends up saving a lot more in human costs by allowing more users use their product than the extra storage costs it adds by the intermediate results being persisted. Finally whereas before there were some concerns around how much data you can store in a database these days there are plenty of database technologies like Snowflake, Databricks or Google BigQuery that can store and handle  virtually unlimited amounts of data.

qweenfo
Level 2

Yes, with Dataiku you can optimize your work processes. First by combining multiple recipes together without having to create datasets for each step in between. This can help reduce redundant data storage. Additionally, you can configure recipes to overwrite existing data sets or use in-memory processing to minimize physical data storage.

0 Kudos

@qweenfo wrote:

Yes, with Dataiku you can optimize your work processes. First by combining multiple recipes together without having to create datasets for each step in between. 


How do you suggest combining multiple Visual recipes of different types (ie Group By, Window, Prepare, etc). Your suggestion makes no sense.


@qweenfo wrote:

Additionally, you can configure recipes to overwrite existing data sets or use in-memory processing to minimize physical data storage.


This is the default behavior so it makes no sense to suggest this. 

Are you a GenAI chatbot?

0 Kudos
DogaS
Level 3
Author

My thoughts exactly...

0 Kudos