Distinct recipe running in-database

davidmakovoz · September 2020

I have a Distinct recipe that computes counts of rows for distinct combination of 8 columns.

I see that the recipe is running In-database, which to the best of my knowledge means that it is equivalent to running the corresponding SQL statement directly in the database. Moreover, the log file contains the SQL statement.

However, when I take that SQL statement and create a SQL recipe with it, it runs much, much faster, something like 10 sec vs 5 min.

When I examine the log of the distinct recipe, I see a long list of emit statements:

[2020/09/30-16:56:25.441] [FRT-57-FlowRunnable] [INFO] [dku.flow.sql] act.compute_material_distinct_NP - Emitted 1450000 rows

The log of the SQL recipe doesn't have those emit statements.

So, in the end it looks like running a Distinct recipe even In-database is not the same as running SQL recipe.

Is there any explanations in the documentation about the differences between these two recipes?

Thanks

David

fchataigner2 · October 2020

Hi,

the differences in runtime for sql recipes can usually be traced to:

- writing on the output via DSS (as in your case). The database isn't handling the writing, so data has to be transmitted to the DSS process and written from there. This can happen if the output isn't a dataset the DB can write to, or if it's in a different connection than the input

- comparing to the runtime of the bare select statement in a notebook (not your case, you did a recipe): most interactive tools only retrieve the first rows and present them to the user, not the full output of the query

GCase · October 2020

Hi @davidmakovoz
,

Based upon this, it appears that you used the SQL engine, but your Output was into a file. Depending upon the network and speed, you are writing out 1.45 million rows and this process likely took the most time. If you go into the log, you should be able to see the time between the execution of the script, the start of row emission, and the completion of row emission. If you are writing from a database to something like separate storage (filesystem on a network server) the data would need to go like the following:

1. SQL Query to Database
2. SQL Database finishes with results
3. Database Sends Data to DSS Server
4. Send Data from DSS Server to Final Storage

To increase the speed, write back into the database instead of an outside file. Depending upon your source DB and end storage location you may be able to use the Fast Path in the Sync recipe. For example, if there is a desire to move data from Snowflake to S3, it would be best to write to a Snowflake table and then use a Sync recipe and Fast Path between Snowflake and S3. Fast Path ensures that data passes through DSS and is typically orders of magnitude faster.

https://doc.dataiku.com/dss/latest/other_recipes/sync.html

Hope this helps.

Grant

Distinct recipe running in-database

Best Answers

Categories

Setup Info

Tags