Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

Options
ecerulm
ecerulm Registered Posts: 45 ✭✭

I realized that Sampling > Random can only be performed with Engine: DSS , it can't be performed with Engine: In-Database (SQL) currently in 12.4.1.

And I wonder why, because it seems to me that they could just send `select * from input_dataset sample row (10000 rows) seed (1337)` to perform the sampling on the Snowflake side. Using the DSS engine requires downloading the dataset to DSS so if I want to sample 10000 rows out of 500M rows that is really inefficient. On the other side performing the sampling on snowflake is instantaneous.

I wrote following product idea https://community.dataiku.com/t5/Product-Ideas/Implement-Sampling-gt-Random-as-Engine-In-Database-SQL-for/idi-p/41152

But I wonder if anybody already knows why this is not implemented, I mean any technical reason why?


Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,744 Neuron
    Options

    Most likely because there is no consistency as to how you can retrieve random rows from all the supported database engines, if they support the feature at all.

  • ecerulm
    ecerulm Registered Posts: 45 ✭✭
    Options

    Aha, so the SQL support is generic among Snowflake, redshift, etc.?

    In any case I did a quick checks and SAMPLE / TABLESAMPLE is supported by many databases (postgress, apache impala, mysql, teradata, BigQuery). Hopefully it somebody at Dataiku can take a look and how to push-down sampling to the SQL engine for the specific databases that support it.

Setup Info
    Tags
      Help me…