Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

ecerulm
Level 4
Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

I realized that Sampling > Random  can only be performed with Engine: DSS , it can't be performed with Engine: In-Database (SQL) currently in 12.4.1. 

And I wonder why, because it seems to me that they could just send `select * from input_dataset sample row (10000 rows) seed (1337)` to perform the sampling on the Snowflake side. Using the DSS engine requires downloading the dataset to DSS so if I want to sample 10000 rows out of 500M rows that is really inefficient. On the other side performing the sampling on snowflake is instantaneous. 

I wrote  following product idea https://community.dataiku.com/t5/Product-Ideas/Implement-Sampling-gt-Random-as-Engine-In-Database-SQ...  

But I wonder if anybody already knows why this is not implemented, I mean any technical reason why? 


0 Kudos
2 Replies
Turribeach

Most likely because there is no consistency as to how you can retrieve random rows from all the supported database engines, if they support the feature at all. 

0 Kudos
ecerulm
Level 4
Author

Aha, so the SQL support is generic among Snowflake, redshift, etc.? 

In any case I did a quick checks and  SAMPLE / TABLESAMPLE is supported by many databases (postgress, apache impala, mysql, teradata, BigQuery). Hopefully it somebody at Dataiku can take a look and how to push-down sampling to the SQL engine for the specific databases that support it.  

0 Kudos