How to analyze all the data, not just the sample ?

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
When using the analyze tool for a column, is it possible to force the analysis to run on the whole dataset instead of just the current sample?

How else could I get, for example, a categorical analysis of a column for all of the data?
Tagged:

Answers

  • jrouquie
    jrouquie Dataiker Alumni Posts: 87 ✭✭✭✭✭✭✭
    There is no such control inside the Analysis dialog box. You can of course change the current sample and set it to be the whole dataset. In which case the analysis (and everything else in the preparation script) will be previewed on the whole dataset.

    Note that the interface will hang of your dataset is too big (as a rule of thumb, compare to the default sample size, which is 30 000).

    For datasets that fit in RAM, I would rather use the value_counts method of pandas.
  • jereze
    jereze Alpha Tester, Dataiker Alumni Posts: 190 ✭✭✭✭✭✭✭✭

    With DSS 1.x (I will update later my post with DSS 2.0 if there is any change), when you explore a dataset or make a preparation script, you work on a sample. As jrouquie suggested, you can change the sample size.

    There is something that could help you: the Visualize tab. The normal behavior is that it works on the same sample that with the Explore tab.

    But, if you are on a SQL dataset or Impala, you can change the engine and get graphs on full dataset. Read more here: http://doc.dataiku.com/dss/1.4/visualization/sampling.html#live-in-database-engine

    I hope that helps.

  • cperdigou
    cperdigou Alpha Tester, Dataiker Alumni Posts: 115 ✭✭✭✭✭✭✭
    This feature is now available in DSS 4.0
Setup Info
    Tags
      Help me…