Community Conundrum 25:Feature Visualization is now live! Read More

How to analyze all the data, not just the sample ?

Dataiker
Dataiker
How to analyze all the data, not just the sample ?
When using the analyze tool for a column, is it possible to force the analysis to run on the whole dataset instead of just the current sample?

How else could I get, for example, a categorical analysis of a column for all of the data?
0 Kudos
3 Replies
Dataiker Alumni
There is no such control inside the Analysis dialog box. You can of course change the current sample and set it to be the whole dataset. In which case the analysis (and everything else in the preparation script) will be previewed on the whole dataset.

Note that the interface will hang of your dataset is too big (as a rule of thumb, compare to the default sample size, which is 30 000).

For datasets that fit in RAM, I would rather use the value_counts method of pandas.
Dataiker
Dataiker

With DSS 1.x (I will update later my post with DSS 2.0 if there is any change), when you explore a dataset or make a preparation script, you work on a sample. As jrouquie suggested, you can change the sample size.



There is something that could help you: the Visualize tab. The normal behavior is that it works on the same sample that with the Explore tab.

But, if you are on a SQL dataset or Impala, you can change the engine and get graphs on full dataset. Read more here: http://doc.dataiku.com/dss/1.4/visualization/sampling.html#live-in-database-engine



I hope that helps.

Jeremy, Product Manager at Dataiku
Dataiker
Dataiker
This feature is now available in DSS 4.0
Labels (2)