New to Dataiku DSS? Try out our NEW Quick Start Programs today and get onboarded on the product in just one hour! Let's go

Scaling snowflake virtual warehouse as a part of recipe

tommas
Level 1
Scaling snowflake virtual warehouse as a part of recipe

Hello,

just a suggestion for some improvement towards Snowflake integration in DSS. It would be nice to have opportunity to dynamically scale up or down the warehouse as a part of the visual recipes. Because there might be some heavy joins or other operations which require more computation power than rest of the flow. 

Currently I do workaround for this and define the scaling as separate steps in scenario but it also means I have to divide the scenario into more smaller steps.

I think it could be part of advanced tab of the recipe where you would just call Snowflake command for ALTER WAREHOUSE and just scale it up for that one recipe. Of course it would be also nice to have an opportunity to call some post recipe command to scale it back down or do it automatically. 

Cheers,

Tomas

 

 

21 Replies
JCR
Dataiker
Dataiker

Hi Thomas,

Thanks very much for the suggestion. I'm the product manager looking after our Snowflake integrations. 

Indeed this is something I have on my radar and started to discuss with Snowflake. We do want to eventually help our users optimize the cost of compute although not sure quite yet how. 

If you can spare some time I'd be interested with following up with you directly?

Thanks

JC 

dan_fejtek
Level 2

Hi, I support this topis as we are facing the same issue. Running a pre-SQL or post-SQL would be very welcome (similar to what is possible when using Impala engine). Currently the only way is to use code/SQL recipe.

Dan

0 Kudos
JCR
Dataiker
Dataiker

Thanks @dan_fejtek . Quick question for you both: if you are already changing the DW size with code in your flows. How do you manage concurrent execution of flows and the possible conflict between 2 flows trying to resize? 

For context, Snowflake recommended we do not resize the DW in the flow, but allow to switch to different DW for each recipe. 

Thoughts?

JC

0 Kudos
tomass
Level 1

Having some issue with my account and I cannot login to that account I created yesterday so I create a new one 🙂 

Yes, having two flows trying to resize the same warehouse could be an issue. Now It is not a problem for me because I am using a separate connection per project or in case of same connection is used by multiple projects then they run one after another and not side by side. Each connection uses a different warehouse so in case of scaling I scale the warehouse just for that one flow.
 
But we are also preparing another project which contains tens of projects with complex dependencies between them and there might happen that two flows would try to resize the same warehouse at the same time. Here I agree it will be much better to allow to switch to a different DW for each recipe than allow scaling because it could cause problems.
 
As far as I know each recipe is standalone snowflake session so allowing developer to overwrite defined warehouse should work. How to do it? I can see multiple options. As @dan_fejtek stated, it could be pre/post-SQL. It could be also an option on advanced tab of recipe (similarly it is done for defining some options for spark or impala).
I can see another place - where you define the computation engine for the recipe - there is that small gear icon. After opening it there might be possibility to change WH for this session.
 
And because each recipe is standalone session, developers do not have to return it back. Meaning that next recipe will again use default warehouse defined on connection level unless the developer would change it again to something else. But I am not UI/UX specialist 🙂 They will probably know better where to place such option 🙂 
 
Coming back to following up with me directly - no problem with that. You can reach me anytime. Actually I had a call with some of your colleagues (Walter and Daniel) yesterday. It was about this: https://medium.com/@tomas.sobotik/pitfalls-when-using-snowflake-in-dataiku-dss-f26e890c8967
 
There I wrote down some other pitfalls I have faced during last year. They told me that you should already have that article in your slack and you are trying to solve some of them 🙂 They suggested me to use this community to submit other ideas/comments so that's why I have created this topic 🙂
 
Cheers!
Tomas
 
JCR
Dataiker
Dataiker

Thanks Tomas. Yes, I've seen your blog 🙂 

dan_fejtek
Level 2

@JCR One more thought still, which is relevant for using different warehouse sizes. At the moment DSS will not allow you to use "In database (SQL)" engine if all source and target datasets of a recipe do not use the same DSS connection. In a sense DSS forces the developers to use a single DSS connection for the whole Snowflake DB.

Since the warehouse is defined on DSS connection, I can imagine that changing the warehouse might be done also by allowing to use the SQL engine for different DSS connections - if the connections share the same snowflake account and user (which means they have the same privileges in source and target connections).

The warehouse to use could then be picked from any source or target of the recipe connections.

For example: staging connection points to STAGING schema, using S warehouse, analytics connection pointing to ANALYTICS schema using XL warehouse. Now when you have a recipe loading data from staging to analytics you could pick, if you want to use the pre-defined S or XL warehouse.

JCR
Dataiker
Dataiker

Absolutely! The concept of connection for Snowflake is quite different from traditional databases. My sense is we need to revisit how we define connections to Snowflake in order to allow the enhancements discussed in this thread. We're looking into this. 

dan_fejtek
Level 2

That is actually my aim - not to use "alter warehouse" but "use warehouse" - we plan to use pre-defined warehouses of different sizes for different jobs.

Alter warehouse is in my opinion not really usable in concurent flows (using the same WH), only in sequential jobs, which however might also be a valid case. So perhaps leaving it to the developer and just enable pre/post-SQL would work just fine.

dan_fejtek
Level 2

@JCR just found one very interesting workaround for running a recipe with a specific warehouse (different than the connection default one). When you setup recipe with all sources and target, pick "in database (sql)" engine, you can go to output dataset, settings->advanced tab and there you have "Pre-write statements" and "Post-write statements" fields, where you actually CAN put "use warehouse xxx" and quickly testing this, it seems to be working.

rikro
Level 2

Hey, we're just trying out Snowflake and your suggestion seems to work.

If the visual recipe is pushed down to Snowflake, both USE WAREHOUSE as well as ALTER WAREHOUSE are working (so it is also possible to increase the size for a single visual recipe).

JCR
Dataiker
Dataiker

Sounds good, and please don't hesitate to open a thread when you find opportunities for improvements in the integration with Snowflake. 

BTW: with 8.0.3 we released a feature that allows you to select a warehouse using the advanced properties of the recipe. 

In the recipe variables you can add:

{

"snowflake_warehouse"="name_of_the_warehouse"

}

0 Kudos
rikro
Level 2

Thats also pretty cool - is there a comprehensive list of variables one could use?

Thanks and best regards,
Richard

0 Kudos
JCR
Dataiker
Dataiker

Here you go:

  • snowflake_role
  • snowflake_warehouse
  • snowflake_schema
  • snowflake_database

Will be added to the documentation soon. 

JCR
Dataiker
Dataiker

As I was working on the documentation I realized some information is missing in this thread.

Specifically, the variables mentioned above need to be substituted in the connection definition. Here is what it looks like:

In the connection:

Screen Shot 2021-01-15 at 8.43.32 AM.png

 In the recipe properties

Screen Shot 2021-01-15 at 8.44.53 AM.png

As always, feedback welcome!

JC

0 Kudos
rikro
Level 2

Thanks for the clarification! Where would one store the connection's default variables? In the DSS system variables?

I know this is getting a little off-topic, but whats the variable precedence?
System > (User?) > Project > (Scenario?) > Recipe? 

 

0 Kudos
JCR
Dataiker
Dataiker

Good question: reading left to right, where left supersedes right

Recipe > Scenario > Project > User > Global

For each of those artifacts there is a place where you can define the variables.

 

0 Kudos
rikro
Level 2

Good to know, thanks!

But that means I cannot rely on variables for security, right? Imagine the following: Security in Snowflake is controlled via a dynamic role. I could now set RoleA in the user's settings.

But if the user knows that there is a RoleB (or RoleC), then he could just use any of recipe, scenario or project variables to overwrite "my" security, correct?

0 Kudos
JCR
Dataiker
Dataiker

The security defined in Snowflake always applies. If a user tries to assume a Role that was not allowed to them in Snowflake they will get an error message while trying to run the flow.  

Now, if a user was allowed multiple Roles in Snowflake, they will be able to pick any of those roles to execute the flow. That is the intended behavior, just like in the Snowflake console. 

Another situation: if a user has multiple Role in Snowflake but you don't want them to be able to switch roles in Dataiku then you don't put the substitution variable in the Snowflake connection, if the field is left blank, Dataiku will pick the default role for the user, if the field has a fixed value for the Role, then this Role will always be used. 

BTW - If you are in the last case. ie. multiple Role were defined in Snowflake for a user but your don't want them to be able to switch, I'm curious to understand why? 

That's because my strategy right now is to facilitate the change of Role and Context as much as possible in Dataiku.

 

 

0 Kudos
rikro
Level 2

Hi, thanks for the detailled explanation!

As I said, we're not using Snowflake in production but having a PoC if it makes sense in the future.

We connect to most third-party systems from Dataiku by using technical/system users and then having multiple connections to the same database with different technical users to segregate data between user groups.

This is something I could do with Snowflake directly (i.e. 5 DSS groups => 5 DSS connections to Snowflake all using the same technical user but with 5 manually entered roles or even 5 system users again).

As there are a lot of the connection attributes that could be dynamically filled, I was thinking if there was an option to just use one connection and let the security be handled dynamically by Dataiku (similar to SQL server user impersonation). Something like this could be possible if we would create a Snowflake role per user, i.e. RL_USER1, RL_USER2 and then use RL_${dssUser} in the connection.

Again that are just some thought I when connecting it and everything else I wanted to test (i.e. does it work at all) worked too flawlessly together with DSS 😛

0 Kudos
JCR
Dataiker
Dataiker

It's actually even easier than that - if I understood correctly your scenario:

You would define users and give them a role in Snowflake

Then you define only one DSS connection security to be 'per user'

And you leave the role empty in the connection

=> No parameter to define, Dataiku will use the default role associated with the user in Snowflake.

JC

0 Kudos
A banner prompting to get Dataiku DSS