Prepare Step Schema and random samples

awaldron Registered Posts: 2 ✭✭✭✭

For prepare steps, we've run into trouble having it auto infer schema based on the first 10,000 rows of a dataset. To get around this, we change the sample to be one of the random samples. This works, but now we have the issue that some of these datasets are huge and auto generating this random sample is consuming a lot of resources.

Is there a way to have the first 10,000 rows be a random sample already preloaded with the rest of the dataset being the normal? That way dataiku only needs to read in the first 10,000 rows, but is still getting a representative sample without having to recompute a sample every time the recipe is opened.


  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron


    I don't know if this will work for you or not. Could you create a column called sample? Pre-place a true value on random10,000 rows of your choice. Then in the sample screen use the filter option to select only the predefined "sample" records. I imagine that this might remove a bunch of the calculations to find the "random" sample. And the compute would be pushed to the database engine which may be more efficient at finding you 10,000 useful records than DSS's 2 pass on the fly method of finding you a new "random" sample every time you refresh. This would allow you the option to go back to the standard method if you liked. You could also have different values in this column for different types of samples. For efficiency, I'd keep these values to Int which should be the easiest for most data repositories to find efficently.

    Just my $0.02 for the evening. Let us know how you get on with this. Others please jump in here if you have any smart idea about this.


    P.S. You might also consider partitioning your dataset, and engineer a partition as your "sample" data. (This feature is not available on the community edition of the tool.)

Setup Info
      Help me…