Dataiku with backend staging areas

Options
ejhharris
ejhharris Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1

I'm looking to build a pipeline where inputs and outputs should be taken from/written to a "clean area" where users can come in and access raw tables for their own projects etc.

The intermediate tables should be stored away in a separate "staging area" where most users are unable to access them.

All of the above is happening on a Snowflake instance.

Is there any performance benefit/penalty/best practice decision in Dataiku to be made in terms of whether these two areas are two schemas under the same database or two separate databases entirely.

e.g clean_db.project_name.input_table & staging_db.project_name.calc_table

vs

project_db.clean_schema.input_table & project_db.staging_schema.calc_table

Happy to clarify further if that doesn't make sense.

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,740 Neuron
    Options

    The words “schema” and “database” have different meanings and implications depending on the database technology. I am not too familiar with Snowflake as I only connected to it a few times but I do remember having to choose a “data warehouse” where to run my queries in Snowflake.
    In general you would want to avoid having separate Dataiku connections for your layers. If you have separate Dataiku connections Dataiku will treat these as complete separate servers and will have to pivot the data via the DSS local server. So I suggest you create a Snowflake connection and test your proposed setup.

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 317 Neuron
    Options

    Hi @ejhharris
    ,

    Some notes in addition to @Turribeach
    's comments.

    We have used both the separate database schema approach and the separate database approach. The schema approach was used in Netezza and while it looks like it would be a bit easier in Snowflake I'd suggest using separate databases if you can. It's just simpler.

    Note that we primarily use SQL Script recipes and these can handle input and output datasets that use different connections (although one still needs to specify the different input schema or database via hard coding or a project variable). I believe visual recipes don't offer this option so if you primarily use these then two separate schemas or databases and thus two DSS connections would not be a good way to go (otherwise as @Turribeach
    noted these recipes would run inside DSS rather in Snowflake).

    Do you really need two different schemas or databases? Could you accomplish the same result by only giving permissions to the output tables? Or only document the output tables and share those? This is what we do. You could also delete the intermediate tables after each run. It'd be most straightforward to run the entire process in one database and one DSS connection to that database.

    Marlan

Setup Info
    Tags
      Help me…