Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

The ability to chain recipes without enforcing input/output datasets

Hi all,

Something I do a lot in DSS is build flows that chain together SQL, python and R jobs. Often these jobs are writing to a database as part of their operation. In these use cases I simply do not need an input or output dataset and I find myself creating endless dummy outputs simply for DSS.

I think it would be a cleaner experience if one could choose to simply chain recopies together without enforcing an output, in the way of tools like Apache Airflow.

I understand this would be quite a core change to how DSS functions, so keen to hear your thoughts!


Dataiker Alumni


Don't forget that you can optimise the performance of SQL pipelines to eliminate the creation of  intermediate datasets. See


Community Manager
Community Manager
Status changed to: Acknowledged
Dataiker Alumni


Intermediate datasets are a fundamental part of DSS DNA. They give clear visualisation of data lineage.  They allow for intuitive partial rebuilding of pipelines and debugging.  They allow for click-driven building of flows with visual recipes, in-combination with custom code-based recipes and plugins.

I believe Airflow is just a DAG of notebooks to be run.  The data dependency chain is completely hidden.  We don't believe that is an improvement.  

So the question is: What do you mean by dummy outputs? Can you give some examples?  We are aware of some very legitimate situations where dummy input / output are needed, and we are looking to address these.  But I am not sure these few edge cases are what you are referring to?

So could you give a bit more detail about the sort of flows you have, and why these intermediate datasets are irritating? 

It sounds like it is not about performance of writing intermediate datasets?  SQL and Spark pipelining are important tools for this cincern in big-data use-cases.

Do you have long chains of 1-in 1-out recipes, and if so what are the recipes doing?

Is it typing the output names? Screen real-estate the datasets use?  Are datasets too big on the screen?

Is an option to hide them in 1-1-1 chains a worthwhile improvement? 

It would be great to see in more detail what you are trying to do, and what benefits you hope to achieve. 




Hi Mark,

Really appreciate you taking the time to respond, here are a few examples:

This is a rather extreme SQL example, but here we are chaining many SQL recipes together and each is running a CREATE OR REPLACE operation into BigQuery, no output is generated:


Because we don;t need an output but DSS requires one, we add the following code to every SQL job to load some dummy data:

SELECT 1 as dummy;

Here is another example with a python recipe which is making API calls and sending some data away, again there is no output to some of these steps, I have highlighted steps which contain dummy data:


Screen real estate is not an issue for us, we have some large flows bu that's not an issue. Nor is processing speed, although without these dummy steps we would see some small improvements.

The ability to hide these datasets would not be a significant improvement as the work would have already gone into creating them and loading their dummy schema, if you see what I mean.

Hope that helps, let me know if I can clarify anything further.


Dataiker Alumni


It is important to remember that the Flow is a data pipeline, not a task pipeline.  It's currently a fundamental principle that recipes have outputs because the pipeline builds datasets, not runs recipes.

In your example, clearly things work OK.  May be it can help to think of the "dummy" output as a status record? 

For the SQL example, I am not sure what you are currently doing, but you probably can use a single SQL Script containing multiple statements, rather than a chain of SQL Queries.  See our docs on the differences

Another option is to run a SQL step in scenario instead of putting the code in flow. The same goes for Python code. 

There are clearly pros and cons of these alternatives.

Level 6

Another use-case is for terminal recipes. There are many cases where a recipe has no output, so a dummy dataset needs to be created. I frequently use Python to load graphs created with Dataiku pipelines into ArangoDB. Since there's no connector currently for that database and I like to have tight control over my insertion logic, I tend to insert the data using Python recipes. Since the output dataset is in an unsupported database, the recipe is the end of the flow. But in a project like this, I often have to create 10 to 20 dummy datasets to become the empty outputs of these recipes. I expect it's quite common for Python recipes to never touch their output dataset, at least for connecting to unsupported databases, sending data to APIs, or loading output files to external servers.

I think allowing terminal recipes would be a big improvement for workflows without many downsides.