Snowflake

Note

You might want to start with our detailed tutorial for your first steps with SQL databases in DSS.

The rest of this page is reference information for Snowflake connections.

DSS supports the full range of features on Snowflake:

  • Reading and writing datasets

  • Executing SQL recipes

  • Performing visual recipes in-database

  • Using live engine for charts

The Snowflake JDBC driver is already preinstalled in DSS and does not need to be installed.

Connect using OAuth2

DSS can also access Snowflake using OAuth2.

OAuth2 access is performed using per-user credentials. Each user must grant DSS permission to access Snowflake on their behalf. You can use your own OAuth2 authorization server so the exact configuration will depend on your environment, but you will need to first set up your OAuth2 authorization server and then a security integration in Snowflake to establish a trust.

In your Snowflake security integration, the oauth_redirect_uri parameter should have the format DSS_BASE_URL/dip/api/oauth2-callback as described under OAuth2 credentials.

(See the official documentation for more details)

Once you have a trust between your OAuth2 authorization server and Snowflake, do in DSS the following:

  • Create a new Snowflake connection

  • Fill in the basic params (Host, Database, Warehouse, Role, Schema) as usual

  • Select “OAuth” as the “Auth Type”. Note that this will force you to use per-user credential.

  • Fill the “Client id”, “Client secret” (if there is one) with the information from your OAuth app

  • Fill the “authorization endpoint” and “token endpoint” with your endpoint. Or leave them blank if you are using the default Snowflake OAuth2 server

  • Fill the scope with the operations and roles permitted for the access token (This depends on your OAuth Server so check the official doc). Or if you do not want to manage Snowflake roles in your OAuth2 server, pass the static value of SESSION:ROLE-ANY in the scope.

  • Depending on your OAuth2 server you may also need to ask for a refresh token in the scope. This is usually done by adding offline_access to the scope but that may vary depending on your OAuth2 provider.

  • Create the connection (you can’t test it yet)

Then for each user:

  • Go to user profile > credentials

  • Click the “Edit” button next to the new connection name

  • Follow the instructions that appear

  • You can now test the connection

Common errors

  • Problem: OAuth2 token request did not return a refresh token.

  • Solution: Be sure that you ask for an OAuth 2.0 refresh token in the scope. This is usually done by adding offline_access to the scope but it may vary depending on your OAuth2 provider.

Writing data into Snowflake

Loading data into a Snowflake database using the regular SQL INSERT or COPY statements is very inefficient and should only be used for extremely small datasets.

The recommended way to load data into a Snowflake table is through a bulk COPY from files stored in Amazon S3 or Azure Blob Storage. DSS will automatically use optimal S3-to-Snowflake or AzureBlob-to-Snowflake copy mechanisms when using a Sync recipe. For more information, see below.

The exception to this rule is when executing a recipe from a Snowflake dataset to a Snowflake dataset using the in-database (SQL) engine.

In other words:

  • You should never have a Flow with a recipe that writes from a non-Snowflake source to a Snowflake dataset, except sync recipes from S3 or Azure Blob Storage

  • “S3 to Snowflake” or “Azure Blob to Snowflake” recipes should only be the “Sync” recipe

  • Snowflake to Snowflake recipes will be fast if and only if the “In-database (SQL)” engine is selected.

For example, if you have a table in Snowflake, and want to use a recipe that does not have an “In-database (SQL) engine”, you should instead use two steps: * A first Snowflake-to-cloud-storage (S3 or Azure Blob) recipe * A cloud-storage-to-Snowflake sync recipe

Syncing to/from S3

When running a Sync recipe where the source is S3 and the target is Snowflake, DSS will automatically use a bulk COPY from files stored in Amazon S3, which is the efficient way to load data into Snowflake.

Due to Snowflake restrictions, the following set of conditions must be met for a Sync recipe to be executed as direct copy:

  • The source dataset must be stored on Amazon S3 and the destination dataset must be stored on Snowflake, or vice-versa

  • The S3 bucket and the Snowflake cluster must be in the same Amazon AWS region

  • The S3 side must be stored with a CSV format, using the UTF-8 charset

  • The CSV separator must be a simple ASCII character

  • For the S3 side, the “Unix” CSV quoting style is not supported

  • For the S3 side, the “Escaping only” CSV quoting style is only supported if the quoting character is \

Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Snowflake column type.

Note that an intermediate Prepare recipe can be used to normalize an input S3 dataset to the above conditions, if the dataset one wants to load into Snowflake does not natively fulfill them.

Syncing to/from Azure Blob Storage

When running a Sync recipe where the source is Azure Blob Storage and the target is Snowflake, DSS will automatically use a bulk COPY from files stored in Azure Blob Storage, which is the efficient way to load data into Snowflake.

Due to Snowflake restrictions, the following set of conditions must be met for a Sync recipe to be executed as direct copy:

  • The source dataset must be stored on Azure Blob Storage and the destination dataset must be stored on Snowflake, or vice-versa

  • The storage account and the Snowflake cluster must be in the same Azure region

  • The blob storage side must be stored with a CSV format, using the UTF-8 charset

  • The CSV separator must be a simple ASCII character

  • For the blob storage side, the “Unix” CSV quoting style is not supported

  • For the blob storage side, the “Escaping only” CSV quoting style is only supported if the quoting character is \

Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Snowflake column type.

Note that an intermediate Prepare recipe can be used to normalize an input Azure Blob Storage dataset to the above conditions, if the dataset one wants to load into Snowflake does not natively fulfill them.

Installation and configuration

In order to take advantage of the native integration, generate a SAS Token, then save the token in the Azure Blob Storage connection settings in the “SAS Token” field.

Syncing to/from Google Cloud Storage

When running a Sync recipe where the source is Google Cloud Storage and the target is Snowflake, DSS will automatically use a bulk COPY from files stored in Google Cloud Storage, which is the efficient way to load data into Snowflake.

Due to Snowflake restrictions, the following set of conditions must be met for a Sync recipe to be executed as direct copy:

  • The source dataset must be stored on Google Cloud Storage and the destination dataset must be stored on Snowflake, or vice-versa

  • The storage side must be stored with a CSV format, using the UTF-8 charset

  • The CSV separator must be a simple ASCII character

  • For the storage side, the “Unix” CSV quoting style is not supported

  • For the storage side, the “Escaping only” CSV quoting style is only supported if the quoting character is \

Additionally, the schema of the input dataset must match the schema of the output dataset, and values stored in fields must be valid with respect to the declared Snowflake column type.

Note that an intermediate Prepare recipe can be used to normalize an input Google Cloud Storage dataset to the above conditions, if the dataset one wants to load into Snowflake does not natively fulfill them.

Installation and configuration

In order to take advantage of the native integration, create a Snowflake storage integration, then save the storage integration name in the GCS connection settings in the “Snowflake storage integration” field.

Spark native integration

The native integration with Spark allows Spark recipes reading from and/or writing to Snowflake datasets to directly exchange data with a Snowflake database. This results in a large increase in performance compared to the default method where data read from or written to Snowflake must be streamed through DSS first.

All Spark recipes which have a Snowflake dataset as an input or an output will automatically take advantage of the native integration with Snowflake once the installation is completed.

Installation and configuration

The Spark native integration is preinstalled and does not need to be installed.

For each Snowflake connection, you will need to check the “Use Spark native integration” box in the Snowflake connection settings.

Switching Role and Warehouse

Snowflake allows users to execute SQL operations while assuming a specific user Role and using a specific Warehouse.

In Dataiku, users have the flexibility to assume a different Role and use a different Warehouse for each recipe of the flow through defining connection variables for:

  • Role

  • Warehouse

How to set it up

1. Parameterize the Snowflake connection

In the definition of the Snowflake connection, substitute the desired parameters by variables. Here, our Roles and Warehouses in Snowflake are uppercase so we capitalize the parameters:

../../_images/snowflake-connection-with-variables.png

2. Define instance-level variables

Defining instance-level variables helps prevent cases where the user will not define a value for the parameters. Global variables will be used as default values.

../../_images/snowflake-global-variables.png

3. Override the variables where needed

This can be recursively done, overriding parameters in the order shown here (where left overrides right): Recipe > Scenario > Project > User > Global

A typical use case involves overloading the Warehouse at the project level:

../../_images/snowflake-project-variables.png

For some use cases it is also possible to overload the Warehouse at the recipe level:

../../_images/snowflake-recipe-variables.png

Another use case involves overloading the Role at the user level. User properties are prefixed via userProperty: so you would need to use the syntax ${userProperty:snowflake_role} instead of ${snowflake_role} when parameterizing the Snowflake connection, and the global variable would be named userProperty:snowflake_role. Then each user would be able to override the role by adding a property snowflake_role to their profile:

../../_images/snowflake-connection-with-variables-user.png ../../_images/snowflake-user-variables.png

Limitations and known issues

Visual recipes

  • Sample/Filter: “contains” operator is not supported

  • Group: aggregating on booleans is not supported

  • Window: cumulative average is not supported (Snowflake limitation)

  • Prepare: in the formula processor, when providing a date parameter to a function, the displayed date might differ between the preview and a real SQL query. It is possible to set the format by setting Advanced JDBC properties in the connection, like TIMESTAMP_OUTPUT_FORMAT for timestamp. See https://docs.snowflake.com/en/user-guide/date-time-input-output.html#output-formats for more information.

Coding recipes

  • Execution plans are not supported

Spark native integration

  • Writing to Snowflake datasets with column names that contain periods is not supported (Snowflake Connector for Spark limitation)

  • The Snowflake Connector for Spark only supports Spark versions 2.2, 2.3, and 2.4

Breaking changes

8.0.2 -> 8.0.3:

  • All columns of type TIMESTAMP_NTZ are now stored in DSS as String, unless users have checked the option Read SQL timestamp without timezone as DSS dates

  • When the Snowflake session TIMESTAMP_TYPE_MAPPING consider TIMESTAMP to be TIMESTAMP_NTZ, all columns of type TIMESTAMP are now stored in DSS as String, unless users have checked the option Read SQL timestamp without timezone as DSS dates

Advanced install of the JDBC driver

The Snowflake JDBC driver is already preinstalled in DSS and does not usually need to be installed. If you need to customize the JDBC driver, followi the instructions

The Snowflake JDBC driver can be downloaded from Snowflake website (https://docs.snowflake.net/manuals/user-guide/jdbc.html)

The driver is made of a single JAR file snowflake-jdbc-VERSION.jar

To install:

  • Copy this JAR file to the lib/jdbc/snowflake subdirectory of the DSS data directory (make it if necessary)

  • Restart DSS

  • In each Snowflake connection, switch the driver mode to “User provided” and enter “lib/jdbc/snowflake” as the Snowflake driver directory

Spark integration

The Spark native integration is preinstalled and does not usually need to be installed. Only follow these steps for custom installations.

Two external JAR files are required to be installed in DSS

  • The same Snowflake JDBC driver mentioned above

  • The Snowflake Connector for Spark (provided by Snowflake). It can be downloaded directly from Maven under the spark-snowflake_2.11 artifact ID (https://search.maven.org/classic/#search%7Cga%7C1%7Cg%3A%22net.snowflake%22). Make sure to choose the correct version of the connector for the version of Spark installed (only versions 2.2-2.4 are supported).

Note

The version of the Snowflake Connector JAR must be compatible with the version of the Snowflake JDBC driver. For example, an older version of the JDBC driver combined with the latest version of the connector may cause errors with confusing exception messages when running jobs. We recommend using the latest version of each JAR.

  • Copy both of these JAR files to the lib/java subdirectory of the DSS data directory (even if the JDBC driver has already been placed in lib/jdbc as described above)

  • Restart DSS

  • Check the “Use Spark native integration” box in the Snowflake connection settings. This checkbox can also be used to toggle the integration on and off without having to remove the JARs.