Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

The ability to chain recipes without enforcing input/output datasets

Hi all,

Something I do a lot in DSS is build flows that chain together SQL, python and R jobs. Often these jobs are writing to a database as part of their operation. In these use cases I simply do not need an input or output dataset and I find myself creating endless dummy outputs simply for DSS.

I think it would be a cleaner experience if one could choose to simply chain recopies together without enforcing an output, in the way of tools like Apache Airflow.

I understand this would be quite a core change to how DSS functions, so keen to hear your thoughts!

Ben

11 Comments
Mark_Treveil
Dataiker Alumni

Ben

Don't forget that you can optimise the performance of SQL pipelines to eliminate the creation of  intermediate datasets. See https://doc.dataiku.com/dss/latest/sql/pipelines/sql_pipelines.html

 

CoreyS
Community Manager
Community Manager
Status changed to: Acknowledged
 
Mark_Treveil
Dataiker Alumni

Ben,

Intermediate datasets are a fundamental part of DSS DNA. They give clear visualisation of data lineage.  They allow for intuitive partial rebuilding of pipelines and debugging.  They allow for click-driven building of flows with visual recipes, in-combination with custom code-based recipes and plugins.

I believe Airflow is just a DAG of notebooks to be run.  The data dependency chain is completely hidden.  We don't believe that is an improvement.  

So the question is: What do you mean by dummy outputs? Can you give some examples?  We are aware of some very legitimate situations where dummy input / output are needed, and we are looking to address these.  But I am not sure these few edge cases are what you are referring to?

So could you give a bit more detail about the sort of flows you have, and why these intermediate datasets are irritating? 

It sounds like it is not about performance of writing intermediate datasets?  SQL and Spark pipelining are important tools for this cincern in big-data use-cases.

Do you have long chains of 1-in 1-out recipes, and if so what are the recipes doing?

Is it typing the output names? Screen real-estate the datasets use?  Are datasets too big on the screen?

Is an option to hide them in 1-1-1 chains a worthwhile improvement? 

It would be great to see in more detail what you are trying to do, and what benefits you hope to achieve. 

Regards
Mark

 

ben_p
Neuron
Neuron

Hi Mark,

Really appreciate you taking the time to respond, here are a few examples:

This is a rather extreme SQL example, but here we are chaining many SQL recipes together and each is running a CREATE OR REPLACE operation into BigQuery, no output is generated:

flow.PNG

Because we don;t need an output but DSS requires one, we add the following code to every SQL job to load some dummy data:

SELECT 1 as dummy;

Here is another example with a python recipe which is making API calls and sending some data away, again there is no output to some of these steps, I have highlighted steps which contain dummy data:

Flow2.PNG

Screen real estate is not an issue for us, we have some large flows bu that's not an issue. Nor is processing speed, although without these dummy steps we would see some small improvements.

The ability to hide these datasets would not be a significant improvement as the work would have already gone into creating them and loading their dummy schema, if you see what I mean.

Hope that helps, let me know if I can clarify anything further.

Ben

Mark_Treveil
Dataiker Alumni

Ben

It is important to remember that the Flow is a data pipeline, not a task pipeline.  It's currently a fundamental principle that recipes have outputs because the pipeline builds datasets, not runs recipes.

In your example, clearly things work OK.  May be it can help to think of the "dummy" output as a status record? 

For the SQL example, I am not sure what you are currently doing, but you probably can use a single SQL Script containing multiple statements, rather than a chain of SQL Queries.  See our docs on the differences

Another option is to run a SQL step in scenario instead of putting the code in flow. The same goes for Python code. 

There are clearly pros and cons of these alternatives.

natejgardner
Neuron
Neuron

Another use-case is for terminal recipes. There are many cases where a recipe has no output, so a dummy dataset needs to be created. I frequently use Python to load graphs created with Dataiku pipelines into ArangoDB. Since there's no connector currently for that database and I like to have tight control over my insertion logic, I tend to insert the data using Python recipes. Since the output dataset is in an unsupported database, the recipe is the end of the flow. But in a project like this, I often have to create 10 to 20 dummy datasets to become the empty outputs of these recipes. I expect it's quite common for Python recipes to never touch their output dataset, at least for connecting to unsupported databases, sending data to APIs, or loading output files to external servers.

I think allowing terminal recipes would be a big improvement for workflows without many downsides.

Turribeach
Level 6

Very interesting discussion as I agree with all 3 posters at the same time. 😀

Like Ben we also run DDL and DML statements against BigQuery. Since only Python code recipes with inputs and outputs in BigQuery using SQLExecutor2 to generate the results are supported we use the BigQuery's Python API. While this is not directly supported by Dataiku we don't see any issues with that since we are using the official Google BQ Python library. Ben wouldn't it be much simpler just to run all your BQ DDLs in a single Python recipe talking directly to BigQuery's Python API. You will still need 1 dummy output dataset but you will greatly simplify your flow.

Like Nate most of our use cases for the above is for terminal recipes where we need to insert our ML output data into our BigQuery Datawarehouse. Our BigQuery Datawarehouse is on a different GCP project and uses different BQ datsets than all our Dataiku flows. We found lots of issues trying to make sure Dataiku wouldn't drop/alter the table we want to insert data to. So moving to a full Python custom solution gave us full control on the destination table. In addition to this we needed a way to indicate to our DWH that the insert load had finished and it was complete. So we developed a custom plugin that does all of this in Python and we can easily reuse across projects to push data to our DWH. All the user needs to do is select destination table name and whether the code should do override the table contents (truncate + insert) or just append data (insert).

Now to Mark's point: "May be it can help to think of the "dummy" output as a status record? ". This is exactly what we ended up doing. As the Python recipes that insert data into our DQH are terminal we simply created a dummy output dataset that populates a status table in append mode with all the stats of the recipe insert (start time, end time, table inserted to, total records, etc). However while we may still create a dataset with insert stats behind the scenes I think it will be a cleaner look if we could remove these dummy datasets from our flow. Using dummy status datasets is confusing because you are getting a different output than the input and nothing really was transformed in the recipe. What if the output of your flow was to call an external API to pass the result of some ML output? (another use case we had).

Finally to Mark's point that "it's currently a fundamental principle that recipes have outputs because the pipeline builds datasets, not runs recipes." I would say that I agree with this but the exception to this statement should be when the resulting dataset is pushed beyond Dataiku's control (either SQL, API, code, etc). In such case it will be a lot less confusing to be able to terminate these recipes with no output. This statement also breaks for recipes with no inputs. I have a flow that downloads files from the SEC and starts with a Python recipe with no inputs. Another flow calls an internal Python API to download data, again no inputs. So for consistency it should be possible to have no outputs as well. 

So I totally support the idea of being able to not have outputs in terminal recipes. Ultimately it is clear that there is no technical limitation for this feature. Python recipes don't check if inputs and outputs are used at all, merely that these are defined. I even use this to add extra dependencies where needed. And like I said you can already have Python recipes with no inputs. 

 

importthepandas
Level 4

bumping this to at the very least hide output "metadata" datasets in Dataiku. Preferably, we'd be able to do exactly what @Turribeach mentions, by disabling outputs from Python recipes or code based recipes where needed.

snehk
Level 1

I agree with the comments above! This change would really enhance the user experience as it  can be confusing to create unnecessary datasets and folders as outputs for SQL and Python recipes. Also, this would help make the workflow simpler to understand and manage. 

ktgross15
Dataiker
Dataiker
Status changed to: In Backlog

Thanks for all your feedback here! We have added the ability to create code recipes without output datasets to our backlog and will provide potential relevant updates here.

Katie