How to Analyze Entire Datasets in Dataiku Instead of Samples

Data8
Data8 Registered Posts: 5 ✭✭✭

I’m working in Dataiku to create a usable dataset by combining internal data with public data.
Currently, each dataset is loaded with only a sample of about 10,000 rows, and I used Visual Recipes to build the final dataset.
However, it seems that the final output was also generated based only on the sample data.

How can I verify the results based on the entire dataset instead of just the sample?
I noticed that if I go into each dataset and change the "Sampling method" to "Random (approx. ratio)" with a ratio of 100%, it appears to load all the data.
But I’m not sure if this is the correct approach, and it’s quite time-consuming to manually change the setting for every dataset.

Comments

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,638 Neuron

    The Dataset Sample and the Explore tab are there to assist you in developing your flow, there are not means of analysing full datasets. When a recipe runs it will always use all of the data from the dataset, not the sample. There is no need to do anothing addional. You should not set the sample to 100% as it will likely lead to memory errors. Dataiku provides many ways to analyse the dataset in its enterity, I will mention some of them:

    1. In the metrics tab compute the default metrics which will include a live row count
    2. Add addtional metrics as required and compute them
    3. In the Explore tab click on the column name and select Analyse. Then you can change the Analyse option to be done across the whole dataset
    4. Create new recipes to see how your data aggregates
    5. Use ther Charts tab to do charts
    6. Use ther Statistics tab to do statistical analysis of your data
    7. Use Insights and Dashboards
    8. Use Jupyter Notebooks and Python
    9. Use SQL Notebooks

    It sounds like you could benefit from doing the Dataiku Academy Certifications to understand more deeply how Dataiku works. Unlike many vendors out there Dataiku gives the training and certifications for free.

  • Nebal
    Nebal Registered Posts: 2

    I am seeking consultancy from experienced people. In my project, I am not sure which data collection and analysis I should employ, given that the project duration is two years. I am seeking collaboration for data analysis. Please contact me if your have the expertise with research design and data analysis

  • jtb
    jtb Registered Posts: 2 ✭✭

    Honestly, this DSS is a joke. If you join on a two fields and your sample comes back with no results how can you compare your results? I think everyone understands the need for efficiency. However, in the design phase there needs to be an option to see all data as it passes through every recipe.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,638 Neuron
    edited December 22

    Sorry but I disagree. You have many options where you can explore all attributes of the data including all the rows and all those options are at your disposal. If you want to load 100% of a table in a dataset sample be my guest. If you have enough memory (and the time to wait for it to load) DSS will do it for you in a data sample or you can use Jupyter Notebooks as well to load them in Python. But no human needs to see millions of rows nor can they process that amount of data in a sensible time frame. And from working from hundreds of Dataiku users I can tell you none of them wants to see their full datasets, you are in the minority. Finally even at millions of rows where your argument is lost it is possible to load these dataset in memory (assuming you have the memory to spare). But our users are now working with datasets with hundreds of millions, billions and hundreds of billions of rows. "There needs to be an option to see all data as it passes through every recipe" doesn't really work once you reach to those.

Setup Info
    Tags
      Help me…