I have a dataset of ~18M rows. I'm analyzing one of the categorical columns. When I run the analysis on the sample data, I get the count for each categorical value (see attached sample_manufacturer.png).
But when I run the analysis on the whole data, all I get is the number of non-empty rows and the number of distinct value (see attached whole_manufacturer.png)
What am I doing wrong?
I am analyzing a SQL-connected dataset.
What I see when I click to configure is really a topic of a different question.
I see "Min, max, avg, stddev, non empty, histogram" and "Distinct value count" toggles, even though this column is clearly recognized as categorical variable. Moreover, I can't just have "Distinct value count" on without having "Min, max, avg, stddev, non empty, histogram" on, untoggling "Min, max, avg, stddev, non empty, histogram" automatically switches off "Distinct value count". And then, when I click compute, I see a message about computing max and min on the stream.
when toggling "full sample" in the analysis, DSS will resort to using the metrics it has computed on the data as source for the indicators in the window (metrics which you can also see in the Status tab). The list of the most frequent values for a categorical comes from the top-K metric, so you need to run the 'compute' button for this column after making sure the "mode, top-k" toggle is on.
About the computations of the metrics themselves, you seem to say that you see the "dss stream" engine name in the modal during the computation. Is your SQL dataset of type "table" or "query" ?
Turning "mode, top-k" on solved the problem.
To answer your question, my source is a "table", not a query.
Btw, I'm still confused about the "mandatory" max and min calculation for categorical data.
min and max (and count too, IIRC) are mandatory because the display in that modal needs them. If you want to not compute them you need to check the metrics via the Status > Metrics tab.