Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

ecerulm · February 2024

I realized that Sampling > Random can only be performed with Engine: DSS , it can't be performed with Engine: In-Database (SQL) currently in 12.4.1.

And I wonder why, because it seems to me that they could just send `select * from input_dataset sample row (10000 rows) seed (1337)` to perform the sampling on the Snowflake side. Using the DSS engine requires downloading the dataset to DSS so if I want to sample 10000 rows out of 500M rows that is really inefficient. On the other side performing the sampling on snowflake is instantaneous.

I wrote following product idea https://community.dataiku.com/t5/Product-Ideas/Implement-Sampling-gt-Random-as-Engine-In-Database-SQL-for/idi-p/41152

But I wonder if anybody already knows why this is not implemented, I mean any technical reason why?

Turribeach · February 2024

Most likely because there is no consistency as to how you can retrieve random rows from all the supported database engines, if they support the feature at all.

ecerulm · February 2024

Aha, so the SQL support is generic among Snowflake, redshift, etc.?

In any case I did a quick checks and SAMPLE / TABLESAMPLE is supported by many databases (postgress, apache impala, mysql, teradata, BigQuery). Hopefully it somebody at Dataiku can take a look and how to push-down sampling to the SQL engine for the specific databases that support it.

Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

Answers

Welcome!

Welcome!

Quick Links

Categories