Dataset -> Analyze column: Approximate or precise?

Thomas_K · February 2018

When I analyze a column in a dataset, I have the options "sample" and "whole data". On "whole data", i only get the percentages of empty vs. non-empty, in "Sample", I also get the number of unique values. I assume this is because doing the job on "Whole data" uses an approximate method like HyperLogLog? If so, what is the error rate parameters, and is there a way to get the actual distinct count without using Python?

fchataigner2 · February 2018

Hi,

in "whole data" mode, some statistics are indeed not available, because they would lead to heavy computations, and we try to limit the statistics to those that can be computed in one or two passes over the data.

The count of distinct values is not approximated, but the median, P25 and P75 values are computed with approximate percentiles. The implementation is then dependent on the database if the dataset is SQL, on Hive or Impala if the dataset is HDFS, and is computed with t-digests using 100 bins.

Thomas_K · February 2018

The problem is, in my case there is no "distinct value count" for the "whole data" mode. I only get that for the sample subset. I added a screenshot in my original post.

fchataigner2 · February 2018

the distinct value count will be in the numerical tab if your column type is numeric

Thomas_K · February 2018

No matter how I format them, I get a distinct value count only for the sample, not for the whole data set - even though I activated it in the settings...

Dataset -> Analyze column: Approximate or precise?

Answers

Categories

Setup Info

Tags