Dataiku with backend staging areas

ejhharris · March 2023

I'm looking to build a pipeline where inputs and outputs should be taken from/written to a "clean area" where users can come in and access raw tables for their own projects etc.

The intermediate tables should be stored away in a separate "staging area" where most users are unable to access them.

All of the above is happening on a Snowflake instance.

Is there any performance benefit/penalty/best practice decision in Dataiku to be made in terms of whether these two areas are two schemas under the same database or two separate databases entirely.

e.g clean_db.project_name.input_table & staging_db.project_name.calc_table

vs

project_db.clean_schema.input_table & project_db.staging_schema.calc_table

Happy to clarify further if that doesn't make sense.

Turribeach · March 2023

The words “schema” and “database” have different meanings and implications depending on the database technology. I am not too familiar with Snowflake as I only connected to it a few times but I do remember having to choose a “data warehouse” where to run my queries in Snowflake.
In general you would want to avoid having separate Dataiku connections for your layers. If you have separate Dataiku connections Dataiku will treat these as complete separate servers and will have to pivot the data via the DSS local server. So I suggest you create a Snowflake connection and test your proposed setup.

Marlan · March 2023

Hi @ejhharris
,

Some notes in addition to @Turribeach
's comments.

We have used both the separate database schema approach and the separate database approach. The schema approach was used in Netezza and while it looks like it would be a bit easier in Snowflake I'd suggest using separate databases if you can. It's just simpler.

Note that we primarily use SQL Script recipes and these can handle input and output datasets that use different connections (although one still needs to specify the different input schema or database via hard coding or a project variable). I believe visual recipes don't offer this option so if you primarily use these then two separate schemas or databases and thus two DSS connections would not be a good way to go (otherwise as @Turribeach
noted these recipes would run inside DSS rather in Snowflake).

Do you really need two different schemas or databases? Could you accomplish the same result by only giving permissions to the output tables? Or only document the output tables and share those? This is what we do. You could also delete the intermediate tables after each run. It'd be most straightforward to run the entire process in one database and one DSS connection to that database.

Marlan

Dataiku with backend staging areas

Answers

Categories

Setup Info

Tags