Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

ecerulm
ecerulm Registered Posts: 45 ✭✭✭✭✭

I realized that Sampling > Random can only be performed with Engine: DSS , it can't be performed with Engine: In-Database (SQL) currently in 12.4.1.

And I wonder why, because it seems to me that they could just send `select * from input_dataset sample row (10000 rows) seed (1337)` to perform the sampling on the Snowflake side. Using the DSS engine requires downloading the dataset to DSS so if I want to sample 10000 rows out of 500M rows that is really inefficient. On the other side performing the sampling on snowflake is instantaneous.

I wrote following product idea https://community.dataiku.com/t5/Product-Ideas/Implement-Sampling-gt-Random-as-Engine-In-Database-SQL-for/idi-p/41152

But I wonder if anybody already knows why this is not implemented, I mean any technical reason why?


Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron

    Most likely because there is no consistency as to how you can retrieve random rows from all the supported database engines, if they support the feature at all.

  • ecerulm
    ecerulm Registered Posts: 45 ✭✭✭✭✭

    Aha, so the SQL support is generic among Snowflake, redshift, etc.?

    In any case I did a quick checks and SAMPLE / TABLESAMPLE is supported by many databases (postgress, apache impala, mysql, teradata, BigQuery). Hopefully it somebody at Dataiku can take a look and how to push-down sampling to the SQL engine for the specific databases that support it.

Setup Info
    Tags
      Help me…