set sample for dataset on v12

NN
NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

Hi ,

One of the challenges i am facing with V12 is reducing sample size for datasets which take a lot of time to load 10K records.
https://doc.dataiku.com/dss/latest/explore/sampling.html#sampling-in-explore

By default the sample size is 10k records for datasets.
in some situations i have datasets which are very large i..e too many columns and hence loading 10k records as sample takes a lot of time.
With the earlier version of dataiku i could just abort the sample loading and then go to the section of the dataset where we define the sample and was able to reduce the sample size to 100 or something lower which ensured that the sample loaded quickly.
But on V12 when i abort the loading step it doesnot show the option to edit the sample.
I have to wait for the whole 10k records sample to be build before i can go in and change the sample size.

Hence my question is does anyone have a better way to define a sample size for a dataset without having to wait for 10K records to be read ?

Best Answer

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17 Answer ✓

    Hi @NN
    ,

    Unfortunately there isn't a way to configure a default sample size, only the memory can be configured.

    That said, while the query does still execute as 10001 records, the number displayed in Explore will still truncate once your max memory size is reached:

    Screenshot 2023-10-13 at 6.11.29 PM.png

    And this should be reflected in the backend.log file as well:

    [2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - [2023/10/13-18:12:17.143] [FT-BuildSampleFutureThread-Nmy9ZsBX-108] [INFO] [dku.datasets.sql]  - [ct: 208] Executing query SELECT *
    [2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  -   FROM "INTEGRATION_TESTS"."SUPPORT"."newimporttoold_DATE_PARTITIONED"
    [2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  -   LIMIT 10001 (st=net.snowflake.client.jdbc.SnowflakeStatementV1@1af5c637)
    ...
    [2023/10/13-18:12:17.253] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - [2023/10/13-18:12:17.253] [FT-BuildSampleFutureThread-Nmy9ZsBX-108] [INFO] [dku.datasets.sql]  - [ct: 318] Done executing query (st=net.snowflake.client.jdbc.SnowflakeStatementV1@1af5c637)
    ...
    [2023/10/13-18:12:17.566] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - com.dataiku.dip.datalayer.memimpl.MemTableAppendingOutput$MemTableSizeLimitReachedException: The first 1915 rows already used 1 MB
    


    But indeed, you might want to raise this as a feature request to be able to configure the default/max sample size as well as the memory limit.

    Thank you!
    Sarina 

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @NN
    ,

    One option if this is an issue with some frequency is that you could adjust the maximum memory setting for dataset sample size in your project(s) under project Settings > Resources Control:

    Screenshot 2023-10-12 at 5.38.25 PM.png

    By default the soft memory limit is 100MB. If you are finding that is too high for some of your datasets to be performant, then lowering this will lower the default memory that will be used for newly-created datasets (existing datasets will use the original default max memory usage setting).

    Thanks,
    Sarina 


  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

    Hi @SarinaS

    The project MB settings is a good option. It will definitely help in most cases.
    But i think this may not be working with snowflake datasets ,
    it seems even if we keep a reduced MB limit the process will try to get 10k records from snowflake using
    select * from table limit 10001
    Would you know if there is there a way to limit the record number as well ?

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

    Thank you Sarina for looking into this.
    I shall raise a request for this.

Setup Info
    Tags
      Help me…