Sampling from Millions of Rows into Dataframe

Options
gblack686
gblack686 Partner, Registered Posts: 62 Partner

I have a plugin that outputs a SQL table via SQL Executor, and I want to run a webapp in a dashboard on the output. The table contains 10+ million records, and I can't sample without a full pass of the dataset through DSS. What's the best way to get a random sample without creating another "sample table" explicitly with an SQL recipe.

I expect this currently is not possible, just wondering if you guys could think of a workaround

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Options

    Hi,

    If your database supports some native random sampling without a full scan, you may try to use SQLExecutor2 also in your database to load the sample. However, assuming that you want a "stable" sample, creating a sampled table in the Flow would probably be a better idea.

Setup Info
    Tags
      Help me…