Analyze Categorical Column

Solved!
davidmakovoz
Analyze Categorical Column

I have a dataset of ~18M rows. I'm analyzing one of the categorical columns. When I run the analysis on the sample data, I get the count for each categorical value (see attached sample_manufacturer.png). 

But when I run the analysis on the whole data, all I get is the number of non-empty rows and the number of distinct value (see attached whole_manufacturer.png)

What am I doing wrong?

David

1 Solution
fchataigner2
Dataiker

Hi,

when toggling "full sample" in the analysis, DSS will resort to using the metrics it has computed on the data as source for the indicators in the window (metrics which you can also see in the Status tab). The list of the most frequent values for a categorical comes from the top-K metric, so you need to run the 'compute' button for this column after making sure the "mode, top-k" toggle is on.

About the computations of the metrics themselves, you seem to say that you see the "dss stream" engine name in the modal during the computation. Is your SQL dataset of type "table" or "query" ?

View solution in original post

5 Replies
Ignacio_Toledo

Hi @davidmakovoz. The dataset you are analyzing, what kind of connection uses? SQL, filesystem, HDF, other?

Also, when you click on 'Click to configure' what do you see?

0 Kudos
davidmakovoz
Author

I am analyzing a SQL-connected dataset. 

What  I see when I click to configure is really a topic of a different question.

I see "Min, max, avg, stddev, non empty, histogram" and "Distinct value count" toggles, even though this column is clearly recognized as categorical variable. Moreover, I can't just have "Distinct value count" on without having "Min, max, avg, stddev, non empty, histogram" on, untoggling "Min, max, avg, stddev, non empty, histogram" automatically switches off "Distinct value count". And then, when I click compute, I see a message about computing max and min on the stream. 

0 Kudos
fchataigner2
Dataiker

Hi,

when toggling "full sample" in the analysis, DSS will resort to using the metrics it has computed on the data as source for the indicators in the window (metrics which you can also see in the Status tab). The list of the most frequent values for a categorical comes from the top-K metric, so you need to run the 'compute' button for this column after making sure the "mode, top-k" toggle is on.

About the computations of the metrics themselves, you seem to say that you see the "dss stream" engine name in the modal during the computation. Is your SQL dataset of type "table" or "query" ?

davidmakovoz
Author

Turning "mode, top-k" on solved the problem.

To answer your question, my source is a "table", not a query. 

Btw, I'm still confused about the "mandatory" max and min calculation for categorical data.

0 Kudos
fchataigner2
Dataiker

min and max (and count too, IIRC) are mandatory because the display in that modal needs them. If you want to not compute them you need to check the metrics via the Status > Metrics tab.

0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku