Create a per-column SQL probe plugin

antonstam
Level 3
Create a per-column SQL probe plugin

I'm trying to create an efficient probe where I can compute multiple metrics for multiple columns in the dataset. For simplicity, let's say that I want a count of all 0's, 1's, and 2's.

With perColumn and singleAggregate set to true, I can create a probe that counts zeroes:

probe.json

{
    "meta" : {
        "label" : "Number of zeroes",
        "description" : "Number of zeroes",
        "icon" : "icon-puzzle-piece"
    },
    "restrictToDatasetType" : "",
    "canHive" : true,
    "canImpala" : true,
    "canSpark" : true,
    "singleAggregate" : true,
    "perColumn" : true
}

probe.sql

SUM(CASE WHEN ${column} = 0 THEN 1 ELSE 0 END)

However, I don't want to create another probe to count the 1's, I want the same probe to tackle that calculation. My guess is to set singleAggregate to false, but then I can't figure out how to write the query. The following is my current attempt:

probe.sql

SELECT
    SUM(CASE WHEN ${column} = 0 THEN 1 ELSE 0 END) AS zero_count,
    SUM(CASE WHEN ${column} = 1 THEN 1 ELSE 0 END) AS one_count
FROM ${DKU_DATASET_TABLE_NAME} ${DKU_WHERE_PARTITION_FILTER}

 Running this probe throws an error: Incorrect syntax near '{'.

3 Replies
fchataigner2
Dataiker

Hi,

the mode in which you can have the same probe applied to several column is only possible when there is a single aggregate, so what you're trying to build is not possible.

Regards,

    Frederic

 

0 Kudos
antonstam
Level 3
Author

Ah, okay. So if I would write different probes for each individual metric, would Dataiku calculate all these probes in one pass on the table, or execute a separate pass for each probe?

0 Kudos
fchataigner2
Dataiker

Hi,

when you have activated metric probes on a dataset, you can get a preview of the "computation runs" that DSS will do at the top of the Status > Edit > Metrics page (the little → is actually a button with a dropdown). For instance, your metrics should end up on the same computation run and thus be a single query to the database.

Note that you can cheat you way out by building the aggregates list in a custom UI (see https://doc.dataiku.com/dss/latest/plugins/reference/other.html#custom-settings-ui 😞 make a JS controller that takes the form values to build an aggregates property in the probe's config, and use it in a simple SQL probe like 

SELECT ${aggregates} FROM ${DKU_DATASET_TABLE_NAME} ${DKU_WHERE_PARTITION_FILTER}

Regards,
    Frederic

 

0 Kudos