set sample for dataset on v12

NN · ‎10-09-2023

Hi ,

One of the challenges i am facing with V12 is reducing sample size for datasets which take a lot of time to load 10K records.
https://doc.dataiku.com/dss/latest/explore/sampling.html#sampling-in-explore

By default the sample size is 10k records for datasets.
in some situations i have datasets which are very large i..e too many columns and hence loading 10k records as sample takes a lot of time.
With the earlier version of dataiku i could just abort the sample loading and then go to the section of the dataset where we define the sample and was able to reduce the sample size to 100 or something lower which ensured that the sample loaded quickly.
But on V12 when i abort the loading step it doesnot show the option to edit the sample.
I have to wait for the whole 10k records sample to be build before i can go in and change the sample size.

Hence my question is does anyone have a better way to define a sample size for a dataset without having to wait for 10K records to be read ?

SarinaS · ‎10-14-2023

Hi @NN,

Unfortunately there isn't a way to configure a default sample size, only the memory can be configured.

That said, while the query does still execute as 10001 records, the number displayed in Explore will still truncate once your max memory size is reached:

And this should be reflected in the backend.log file as well:

[2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - [2023/10/13-18:12:17.143] [FT-BuildSampleFutureThread-Nmy9ZsBX-108] [INFO] [dku.datasets.sql]  - [ct: 208] Executing query SELECT *
[2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  -   FROM "INTEGRATION_TESTS"."SUPPORT"."newimporttoold_DATE_PARTITIONED"
[2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  -   LIMIT 10001 (st=net.snowflake.client.jdbc.SnowflakeStatementV1@1af5c637)
...
[2023/10/13-18:12:17.253] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - [2023/10/13-18:12:17.253] [FT-BuildSampleFutureThread-Nmy9ZsBX-108] [INFO] [dku.datasets.sql]  - [ct: 318] Done executing query (st=net.snowflake.client.jdbc.SnowflakeStatementV1@1af5c637)
...
[2023/10/13-18:12:17.566] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - com.dataiku.dip.datalayer.memimpl.MemTableAppendingOutput$MemTableSizeLimitReachedException: The first 1915 rows already used 1 MB

But indeed, you might want to raise this as a feature request to be able to configure the default/max sample size as well as the memory limit.

Thank you!
Sarina

View solution in original post

SarinaS · ‎10-13-2023

Hi @NN,

One option if this is an issue with some frequency is that you could adjust the maximum memory setting for dataset sample size in your project(s) under project Settings > Resources Control:

By default the soft memory limit is 100MB. If you are finding that is too high for some of your datasets to be performant, then lowering this will lower the default memory that will be used for newly-created datasets (existing datasets will use the original default max memory usage setting).

Thanks,
Sarina

NN · ‎10-13-2023

Hi @SarinaS
The project MB settings is a good option. It will definitely help in most cases.
But i think this may not be working with snowflake datasets ,
it seems even if we keep a reduced MB limit the process will try to get 10k records from snowflake using
select * from table limit 10001
Would you know if there is there a way to limit the record number as well ?

SarinaS · ‎10-14-2023

Hi @NN,

Unfortunately there isn't a way to configure a default sample size, only the memory can be configured.

That said, while the query does still execute as 10001 records, the number displayed in Explore will still truncate once your max memory size is reached:

And this should be reflected in the backend.log file as well:

[2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - [2023/10/13-18:12:17.143] [FT-BuildSampleFutureThread-Nmy9ZsBX-108] [INFO] [dku.datasets.sql]  - [ct: 208] Executing query SELECT *
[2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  -   FROM "INTEGRATION_TESTS"."SUPPORT"."newimporttoold_DATE_PARTITIONED"
[2023/10/13-18:12:17.143] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  -   LIMIT 10001 (st=net.snowflake.client.jdbc.SnowflakeStatementV1@1af5c637)
...
[2023/10/13-18:12:17.253] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - [2023/10/13-18:12:17.253] [FT-BuildSampleFutureThread-Nmy9ZsBX-108] [INFO] [dku.datasets.sql]  - [ct: 318] Done executing query (st=net.snowflake.client.jdbc.SnowflakeStatementV1@1af5c637)
...
[2023/10/13-18:12:17.566] [KNL-FEK-3O7jgUCI-err-1320] [INFO] [dku.utils]  - com.dataiku.dip.datalayer.memimpl.MemTableAppendingOutput$MemTableSizeLimitReachedException: The first 1915 rows already used 1 MB

But indeed, you might want to raise this as a feature request to be able to configure the default/max sample size as well as the memory limit.

Thank you!
Sarina

NN · ‎10-17-2023

Thank you Sarina for looking into this.
I shall raise a request for this.

Sign up to take part

set sample for dataset on v12

set sample for dataset on v12