Is there any reason why Sampling > Random cannot be performed In-Database for Snowflake?

Registered Posts: 47 ✭✭✭✭✭

I realized that Sampling > Random can only be performed with Engine: DSS , it can't be performed with Engine: In-Database (SQL) currently in 12.4.1.

And I wonder why, because it seems to me that they could just send `select * from input_dataset sample row (10000 rows) seed (1337)` to perform the sampling on the Snowflake side. Using the DSS engine requires downloading the dataset to DSS so if I want to sample 10000 rows out of 500M rows that is really inefficient. On the other side performing the sampling on snowflake is instantaneous.

I wrote following product idea https://community.dataiku.com/t5/Product-Ideas/Implement-Sampling-gt-Random-as-Engine-In-Database-SQL-for/idi-p/41152

But I wonder if anybody already knows why this is not implemented, I mean any technical reason why?


Answers

  • Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,390 Neuron

    Most likely because there is no consistency as to how you can retrieve random rows from all the supported database engines, if they support the feature at all.

  • Registered Posts: 47 ✭✭✭✭✭

    Aha, so the SQL support is generic among Snowflake, redshift, etc.?

    In any case I did a quick checks and SAMPLE / TABLESAMPLE is supported by many databases (postgress, apache impala, mysql, teradata, BigQuery). Hopefully it somebody at Dataiku can take a look and how to push-down sampling to the SQL engine for the specific databases that support it.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.