The ability to chain recipes without enforcing input/output datasets

Hi all,

Something I do a lot in DSS is build flows that chain together SQL, python and R jobs. Often these jobs are writing to a database as part of their operation. In these use cases I simply do not need an input or output dataset and I find myself creating endless dummy outputs simply for DSS.

I think it would be a cleaner experience if one could choose to simply chain recopies together without enforcing an output, in the way of tools like Apache Airflow.

I understand this would be quite a core change to how DSS functions, so keen to hear your thoughts!

Ben

6 Comments
Mark_Treveil
Dataiker Alumni

Ben

Don't forget that you can optimise the performance of SQL pipelines to eliminate the creation of  intermediate datasets. See https://doc.dataiku.com/dss/latest/sql/pipelines/sql_pipelines.html

 

CoreyS
Community Manager
Community Manager
Status changed to: Acknowledged
 
Mark_Treveil
Dataiker Alumni

Ben,

Intermediate datasets are a fundamental part of DSS DNA. They give clear visualisation of data lineage.  They allow for intuitive partial rebuilding of pipelines and debugging.  They allow for click-driven building of flows with visual recipes, in-combination with custom code-based recipes and plugins.

I believe Airflow is just a DAG of notebooks to be run.  The data dependency chain is completely hidden.  We don't believe that is an improvement.  

So the question is: What do you mean by dummy outputs? Can you give some examples?  We are aware of some very legitimate situations where dummy input / output are needed, and we are looking to address these.  But I am not sure these few edge cases are what you are referring to?

So could you give a bit more detail about the sort of flows you have, and why these intermediate datasets are irritating? 

It sounds like it is not about performance of writing intermediate datasets?  SQL and Spark pipelining are important tools for this cincern in big-data use-cases.

Do you have long chains of 1-in 1-out recipes, and if so what are the recipes doing?

Is it typing the output names? Screen real-estate the datasets use?  Are datasets too big on the screen?

Is an option to hide them in 1-1-1 chains a worthwhile improvement? 

It would be great to see in more detail what you are trying to do, and what benefits you hope to achieve. 

Regards
Mark

 

ben_p
Neuron
Neuron

Hi Mark,

Really appreciate you taking the time to respond, here are a few examples:

This is a rather extreme SQL example, but here we are chaining many SQL recipes together and each is running a CREATE OR REPLACE operation into BigQuery, no output is generated:

flow.PNG

Because we don;t need an output but DSS requires one, we add the following code to every SQL job to load some dummy data:

SELECT 1 as dummy;

Here is another example with a python recipe which is making API calls and sending some data away, again there is no output to some of these steps, I have highlighted steps which contain dummy data:

Flow2.PNG

Screen real estate is not an issue for us, we have some large flows bu that's not an issue. Nor is processing speed, although without these dummy steps we would see some small improvements.

The ability to hide these datasets would not be a significant improvement as the work would have already gone into creating them and loading their dummy schema, if you see what I mean.

Hope that helps, let me know if I can clarify anything further.

Ben

Mark_Treveil
Dataiker Alumni

Ben

It is important to remember that the Flow is a data pipeline, not a task pipeline.  It's currently a fundamental principle that recipes have outputs because the pipeline builds datasets, not runs recipes.

In your example, clearly things work OK.  May be it can help to think of the "dummy" output as a status record? 

For the SQL example, I am not sure what you are currently doing, but you probably can use a single SQL Script containing multiple statements, rather than a chain of SQL Queries.  See our docs on the differences

Another option is to run a SQL step in scenario instead of putting the code in flow. The same goes for Python code. 

There are clearly pros and cons of these alternatives.

natejgardner
Level 5

Another use-case is for terminal recipes. There are many cases where a recipe has no output, so a dummy dataset needs to be created. I frequently use Python to load graphs created with Dataiku pipelines into ArangoDB. Since there's no connector currently for that database and I like to have tight control over my insertion logic, I tend to insert the data using Python recipes. Since the output dataset is in an unsupported database, the recipe is the end of the flow. But in a project like this, I often have to create 10 to 20 dummy datasets to become the empty outputs of these recipes. I expect it's quite common for Python recipes to never touch their output dataset, at least for connecting to unsupported databases, sending data to APIs, or loading output files to external servers.

I think allowing terminal recipes would be a big improvement for workflows without many downsides.

Public