How to analyze all the data, not just the sample ?
UserBird
Dataiker, Alpha Tester Posts: 535 Dataiker
When using the analyze tool for a column, is it possible to force the analysis to run on the whole dataset instead of just the current sample?
How else could I get, for example, a categorical analysis of a column for all of the data?
How else could I get, for example, a categorical analysis of a column for all of the data?
Answers
-
There is no such control inside the Analysis dialog box. You can of course change the current sample and set it to be the whole dataset. In which case the analysis (and everything else in the preparation script) will be previewed on the whole dataset.
Note that the interface will hang of your dataset is too big (as a rule of thumb, compare to the default sample size, which is 30 000).
For datasets that fit in RAM, I would rather use the value_counts method of pandas. -
With DSS 1.x (I will update later my post with DSS 2.0 if there is any change), when you explore a dataset or make a preparation script, you work on a sample. As jrouquie suggested, you can change the sample size.
There is something that could help you: the Visualize tab. The normal behavior is that it works on the same sample that with the Explore tab.
But, if you are on a SQL dataset or Impala, you can change the engine and get graphs on full dataset. Read more here: http://doc.dataiku.com/dss/1.4/visualization/sampling.html#live-in-database-engineI hope that helps.
-
This feature is now available in DSS 4.0