Cannot select whole dataset sampling option

Doga · June 11

Hi,

I don't see the "Whole Dataset" option when I go to the Sample tab that opens up on the left side of the screen. Please see the attached screenshot. I used to see it I don't know what has changed. Any ideas?

Thanks,

Operating system used: Windows

Ignacio_Toledo · June 11

Hi @DogaS
,

As far as I recall, there has never been an option to select the "Whole Dataset" in the Explore tab when configuring the sample.

What happens a lot, specially when you have datasets with less than 10 thousand rows, and not too many columns, is that you get the message "Whole Dataset" instead of "Sample", but the sample configuration is still "First records". See attached screenshot.

You can try to increase the number of rows to retrieve, and try to see if then you get the "Whole Dataset". But even then, there are memory limits to the maximum amount of data that will be shown in the "Explore" tab.

Hope this helps.

Turribeach · June 11

Sampling the whole dataset is a bad idea. If there was such option, which there isn't as @Ignacio_Toledo
rightly said, you would risk loading too much data in memory and the sampling be aborted. When you run the recipe the whole dataset will be used. Additionally forcing to load the whole dataset to explore will make navigating the flow very painful. Is there any reason why you need to see all columns and all rows? You can analise each column separately and you can easily calculate record counts if you want to see how many rows there are in total.

Doga · June 11

Hi @Turribeach
,

I am working on a transactional healthcare database. I grouped the data by the "code" column, showing # claims associated with each code. I wanted to explore the codes for which the claim count was highest, so I sorted desc by the # claims column. Since the data I am viewing is a sample of the full dataset, I wanted to make sure I wasn't missing any code with a very high claim count just because it wasn't in my sample. That's why I wanted to view the entire dataset.

I could potentially use another recipe to sort it on the "# claims" column, but I thought I could quickly look at it in the output of the group by instead.

Does that make sense?

Turribeach · June 11

It totally makes sense and it's a very valid use case. But you should have started this thread with that as that is your requirement. Always start with your requirement not with how you think you can achieve it as there could be better ways of achieving what you want.

The easiest way to achieve what you want is to use the column Analize option. In the Explore tab of your dataset hover over the column heading and a small black down arrow will be displayed. Select it, then select Analize. You will now be able to analise this column only and you can easily select the "Whole Data" option without having to do a new recipe or sample the whole dataset. Dataiku will analise all the values on this column and present you a summary of them and some very useful statistics. If this is not enough you can go to the Statistics tab of the dataset and do further statistical analysis. Finally if you databse is SQL based you can also do further analysis using a SQL Notebook (under the Notebooks menu). This is an integrated SQL IDE which works very well. Note that you will also pull only 10,000 rows by default but in SQL you can quickly change the output to perform aggregations or grouping and reduce the result set. Screenshot 2024-06-11 at 21.24.56.png

Doga · June 11

Hi @Ignacio_Toledo
,

I think I am confusing it with the functionality in the chart option. You're probably right that this functionality was never available to begin with in the Explore tab in a dataset.

Your recommendation and callouts about them make sense.

Thanks!

Doga · June 11

Hi @Turribeach
,

That's a fair callout, I will include my requirements in the future posts.

Regarding the solution you proposed, I am not interested in looking at the distribution of a specific column. I was more interested in ordering the data by the column "# claims" so that I can check what codes (another column in the data) were associated with the highest counts.

I think I can just try to get all data as @Ignacio_Toledo
suggested by changing the # records in the sample option (keeping in mind the callouts he made about using this option) or I can simply use another recipe to sort it and get the default first 10k records as sample to ensure I am not missing any codes.

Thanks for the inputs!

tgb417 · June 13

@DogaS

I hear your desire to see the whole dataset. I often have similar use case. So I will often set things up to load say the first 1,000,000 or even 10,000,000 records which is typically larger than many of my datasets. And bump the memory way up to say1550 MB is successful for me.

I’m not at my instance right now so I can not see the actual setting. But to get to 1550 MB one has to go into the Admin settings and make a change. My DSS instance is running on a computer with 32GB ram. So for most of my datasets if I want to see all of the data I can. For larger datasets I will take a more database approach with grouping or do random sampling and hope that I get all of the modalities in a particular column or set of columns.

Hope that might help a bit let us know how you get on with your data.

As @Turribeach
this will slow down the loading of datasets. This risks running out of memory.

Cannot select whole dataset sampling option

Answers

Categories

Setup Info

Tags