Cannot select whole dataset sampling option
Hi,
I don't see the "Whole Dataset" option when I go to the Sample tab that opens up on the left side of the screen. Please see the attached screenshot. I used to see it I don't know what has changed. Any ideas?
Thanks,
Operating system used: Windows
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi @DogaS
,As far as I recall, there has never been an option to select the "Whole Dataset" in the Explore tab when configuring the sample.
What happens a lot, specially when you have datasets with less than 10 thousand rows, and not too many columns, is that you get the message "Whole Dataset" instead of "Sample", but the sample configuration is still "First records". See attached screenshot.
You can try to increase the number of rows to retrieve, and try to see if then you get the "Whole Dataset". But even then, there are memory limits to the maximum amount of data that will be shown in the "Explore" tab.
Hope this helps.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,049 Neuron
Sampling the whole dataset is a bad idea. If there was such option, which there isn't as @Ignacio_Toledo
rightly said, you would risk loading too much data in memory and the sampling be aborted. When you run the recipe the whole dataset will be used. Additionally forcing to load the whole dataset to explore will make navigating the flow very painful. Is there any reason why you need to see all columns and all rows? You can analise each column separately and you can easily calculate record counts if you want to see how many rows there are in total. -
Hi @Turribeach
,I am working on a transactional healthcare database. I grouped the data by the "code" column, showing # claims associated with each code. I wanted to explore the codes for which the claim count was highest, so I sorted desc by the # claims column. Since the data I am viewing is a sample of the full dataset, I wanted to make sure I wasn't missing any code with a very high claim count just because it wasn't in my sample. That's why I wanted to view the entire dataset.
I could potentially use another recipe to sort it on the "# claims" column, but I thought I could quickly look at it in the output of the group by instead.
Does that make sense?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,049 Neuron
It totally makes sense and it's a very valid use case. But you should have started this thread with that as that is your requirement. Always start with your requirement not with how you think you can achieve it as there could be better ways of achieving what you want.
The easiest way to achieve what you want is to use the column Analize option. In the Explore tab of your dataset hover over the column heading and a small black down arrow will be displayed. Select it, then select Analize. You will now be able to analise this column only and you can easily select the "Whole Data" option without having to do a new recipe or sample the whole dataset. Dataiku will analise all the values on this column and present you a summary of them and some very useful statistics. If this is not enough you can go to the Statistics tab of the dataset and do further statistical analysis. Finally if you databse is SQL based you can also do further analysis using a SQL Notebook (under the Notebooks menu). This is an integrated SQL IDE which works very well. Note that you will also pull only 10,000 rows by default but in SQL you can quickly change the output to perform aggregations or grouping and reduce the result set.
-
Hi @Ignacio_Toledo
,I think I am confusing it with the functionality in the chart option. You're probably right that this functionality was never available to begin with in the Explore tab in a dataset.
Your recommendation and callouts about them make sense.
Thanks!
-
Hi @Turribeach
,That's a fair callout, I will include my requirements in the future posts.
Regarding the solution you proposed, I am not interested in looking at the distribution of a specific column. I was more interested in ordering the data by the column "# claims" so that I can check what codes (another column in the data) were associated with the highest counts.
I think I can just try to get all data as @Ignacio_Toledo
suggested by changing the # records in the sample option (keeping in mind the callouts he made about using this option) or I can simply use another recipe to sort it and get the default first 10k records as sample to ensure I am not missing any codes.Thanks for the inputs!
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I hear your desire to see the whole dataset. I often have similar use case. So I will often set things up to load say the first 1,000,000 or even 10,000,000 records which is typically larger than many of my datasets. And bump the memory way up to say1550 MB is successful for me.
I’m not at my instance right now so I can not see the actual setting. But to get to 1550 MB one has to go into the Admin settings and make a change. My DSS instance is running on a computer with 32GB ram. So for most of my datasets if I want to see all of the data I can. For larger datasets I will take a more database approach with grouping or do random sampling and hope that I get all of the modalities in a particular column or set of columns.
Hope that might help a bit let us know how you get on with your data.
As @Turribeach
this will slow down the loading of datasets. This risks running out of memory.