Dataset -> Analyze column: Approximate or precise?

Thomas_K
Thomas_K Registered Posts: 15 ✭✭✭✭

When I analyze a column in a dataset, I have the options "sample" and "whole data". On "whole data", i only get the percentages of empty vs. non-empty, in "Sample", I also get the number of unique values. I assume this is because doing the job on "Whole data" uses an approximate method like HyperLogLog? If so, what is the error rate parameters, and is there a way to get the actual distinct count without using Python?

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Hi,

    in "whole data" mode, some statistics are indeed not available, because they would lead to heavy computations, and we try to limit the statistics to those that can be computed in one or two passes over the data.

    The count of distinct values is not approximated, but the median, P25 and P75 values are computed with approximate percentiles. The implementation is then dependent on the database if the dataset is SQL, on Hive or Impala if the dataset is HDFS, and is computed with t-digests using 100 bins.
  • Thomas_K
    Thomas_K Registered Posts: 15 ✭✭✭✭
    The problem is, in my case there is no "distinct value count" for the "whole data" mode. I only get that for the sample subset. I added a screenshot in my original post.
  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    the distinct value count will be in the numerical tab if your column type is numeric
  • Thomas_K
    Thomas_K Registered Posts: 15 ✭✭✭✭
    No matter how I format them, I get a distinct value count only for the sample, not for the whole data set - even though I activated it in the settings...
Setup Info
    Tags
      Help me…