How to Analyze Entire Datasets in Dataiku Instead of Samples

I’m working in Dataiku to create a usable dataset by combining internal data with public data.
Currently, each dataset is loaded with only a sample of about 10,000 rows, and I used Visual Recipes to build the final dataset.
However, it seems that the final output was also generated based only on the sample data.
How can I verify the results based on the entire dataset instead of just the sample?
I noticed that if I go into each dataset and change the "Sampling method" to "Random (approx. ratio)" with a ratio of 100%, it appears to load all the data.
But I’m not sure if this is the correct approach, and it’s quite time-consuming to manually change the setting for every dataset.
Comments
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron
The Dataset Sample and the Explore tab are there to assist you in developing your flow, there are not means of analysing full datasets. When a recipe runs it will always use all of the data from the dataset, not the sample. There is no need to do anothing addional. You should not set the sample to 100% as it will likely lead to memory errors. Dataiku provides many ways to analyse the dataset in its enterity, I will mention some of them:
- In the metrics tab compute the default metrics which will include a live row count
- Add addtional metrics as required and compute them
- In the Explore tab click on the column name and select Analyse. Then you can change the Analyse option to be done across the whole dataset
- Create new recipes to see how your data aggregates
- Use ther Charts tab to do charts
- Use ther Statistics tab to do statistical analysis of your data
- Use Insights and Dashboards
- Use Jupyter Notebooks and Python
- Use SQL Notebooks
It sounds like you could benefit from doing the Dataiku Academy Certifications to understand more deeply how Dataiku works. Unlike many vendors out there Dataiku gives the training and certifications for free.
-
I am seeking consultancy from experienced people. In my project, I am not sure which data collection and analysis I should employ, given that the project duration is two years. I am seeking collaboration for data analysis. Please contact me if your have the expertise with research design and data analysis