Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

ecerulm · ‎02-23-2024

I realized that Sampling > Random can only be performed with Engine: DSS , it can't be performed with Engine: In-Database (SQL) currently in 12.4.1.

And I wonder why, because it seems to me that they could just send `select * from input_dataset sample row (10000 rows) seed (1337)` to perform the sampling on the Snowflake side. Using the DSS engine requires downloading the dataset to DSS so if I want to sample 10000 rows out of 500M rows that is really inefficient. On the other side performing the sampling on snowflake is instantaneous.

I wrote following product idea https://community.dataiku.com/t5/Product-Ideas/Implement-Sampling-gt-Random-as-Engine-In-Database-SQ...

But I wonder if anybody already knows why this is not implemented, I mean any technical reason why?

Turribeach · ‎02-23-2024

Most likely because there is no consistency as to how you can retrieve random rows from all the supported database engines, if they support the feature at all.

ecerulm · ‎02-26-2024

Aha, so the SQL support is generic among Snowflake, redshift, etc.?

In any case I did a quick checks and SAMPLE / TABLESAMPLE is supported by many databases (postgress, apache impala, mysql, teradata, BigQuery). Hopefully it somebody at Dataiku can take a look and how to push-down sampling to the SQL engine for the specific databases that support it.

Sign up to take part

Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

Setup info