Discover the winners & finalists of the 2022 Dataiku Frontrunner Awards!READ THEIR USE CASES

Sampling from Millions of Rows into Dataframe

Level 4
Sampling from Millions of Rows into Dataframe

I have a plugin that outputs a SQL table via SQL Executor, and I want to run a webapp in a dashboard on the output.  The table contains 10+ million records, and I can't sample without a full pass of the dataset through DSS.  What's the best way to get a random sample without creating another "sample table" explicitly with an SQL recipe.  

I expect this currently is not possible, just wondering if you guys could think of a workaround

1 Reply


If your database supports some native random sampling without a full scan, you may try to use SQLExecutor2 also in your database to load the sample. However, assuming that you want a "stable" sample, creating a sampled table in the Flow would probably be a better idea.