Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on February 23, 2024 12:00PM
Likes: 0
Replies: 2
I realized that Sampling > Random can only be performed with Engine: DSS , it can't be performed with Engine: In-Database (SQL) currently in 12.4.1.
And I wonder why, because it seems to me that they could just send `select * from input_dataset sample row (10000 rows) seed (1337)` to perform the sampling on the Snowflake side. Using the DSS engine requires downloading the dataset to DSS so if I want to sample 10000 rows out of 500M rows that is really inefficient. On the other side performing the sampling on snowflake is instantaneous.
I wrote following product idea https://community.dataiku.com/t5/Product-Ideas/Implement-Sampling-gt-Random-as-Engine-In-Database-SQL-for/idi-p/41152
But I wonder if anybody already knows why this is not implemented, I mean any technical reason why?
Most likely because there is no consistency as to how you can retrieve random rows from all the supported database engines, if they support the feature at all.
Aha, so the SQL support is generic among Snowflake, redshift, etc.?
In any case I did a quick checks and SAMPLE / TABLESAMPLE is supported by many databases (postgress, apache impala, mysql, teradata, BigQuery). Hopefully it somebody at Dataiku can take a look and how to push-down sampling to the SQL engine for the specific databases that support it.