Analyze Categorical Column

Options
davidmakovoz
davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron

I have a dataset of ~18M rows. I'm analyzing one of the categorical columns. When I run the analysis on the sample data, I get the count for each categorical value (see attached sample_manufacturer.png).

But when I run the analysis on the whole data, all I get is the number of non-empty rows and the number of distinct value (see attached whole_manufacturer.png)

What am I doing wrong?

David

Best Answer

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Answer ✓
    Options

    Hi,

    when toggling "full sample" in the analysis, DSS will resort to using the metrics it has computed on the data as source for the indicators in the window (metrics which you can also see in the Status tab). The list of the most frequent values for a categorical comes from the top-K metric, so you need to run the 'compute' button for this column after making sure the "mode, top-k" toggle is on.

    About the computations of the metrics themselves, you seem to say that you see the "dss stream" engine name in the modal during the computation. Is your SQL dataset of type "table" or "query" ?

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi @davidmakovoz
    . The dataset you are analyzing, what kind of connection uses? SQL, filesystem, HDF, other?

    Also, when you click on 'Click to configure' what do you see?

  • davidmakovoz
    davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron
    Options

    I am analyzing a SQL-connected dataset.

    What I see when I click to configure is really a topic of a different question.

    I see "Min, max, avg, stddev, non empty, histogram" and "Distinct value count" toggles, even though this column is clearly recognized as categorical variable. Moreover, I can't just have "Distinct value count" on without having "Min, max, avg, stddev, non empty, histogram" on, untoggling "Min, max, avg, stddev, non empty, histogram" automatically switches off "Distinct value count". And then, when I click compute, I see a message about computing max and min on the stream.

  • davidmakovoz
    davidmakovoz Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 67 Neuron
    Options

    Turning "mode, top-k" on solved the problem.

    To answer your question, my source is a "table", not a query.

    Btw, I'm still confused about the "mandatory" max and min calculation for categorical data.

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    min and max (and count too, IIRC) are mandatory because the display in that modal needs them. If you want to not compute them you need to check the metrics via the Status > Metrics tab.

Setup Info
    Tags
      Help me…