Create a per-column SQL probe plugin
I'm trying to create an efficient probe where I can compute multiple metrics for multiple columns in the dataset. For simplicity, let's say that I want a count of all 0's, 1's, and 2's.
With perColumn and singleAggregate set to true, I can create a probe that counts zeroes:
probe.json
{ "meta" : { "label" : "Number of zeroes", "description" : "Number of zeroes", "icon" : "icon-puzzle-piece" }, "restrictToDatasetType" : "", "canHive" : true, "canImpala" : true, "canSpark" : true, "singleAggregate" : true, "perColumn" : true }
probe.sql
SUM(CASE WHEN ${column} = 0 THEN 1 ELSE 0 END)
However, I don't want to create another probe to count the 1's, I want the same probe to tackle that calculation. My guess is to set singleAggregate to false, but then I can't figure out how to write the query. The following is my current attempt:
probe.sql
SELECT SUM(CASE WHEN ${column} = 0 THEN 1 ELSE 0 END) AS zero_count, SUM(CASE WHEN ${column} = 1 THEN 1 ELSE 0 END) AS one_count FROM ${DKU_DATASET_TABLE_NAME} ${DKU_WHERE_PARTITION_FILTER}
Running this probe throws an error: Incorrect syntax near '{'.
Answers
-
Hi,
the mode in which you can have the same probe applied to several column is only possible when there is a single aggregate, so what you're trying to build is not possible.
Regards,
Frederic
-
Ah, okay. So if I would write different probes for each individual metric, would Dataiku calculate all these probes in one pass on the table, or execute a separate pass for each probe?
-
Hi,
when you have activated metric probes on a dataset, you can get a preview of the "computation runs" that DSS will do at the top of the Status > Edit > Metrics page (the little → is actually a button with a dropdown). For instance, your metrics should end up on the same computation run and thus be a single query to the database.
Note that you can cheat you way out by building the aggregates list in a custom UI (see https://doc.dataiku.com/dss/latest/plugins/reference/other.html#custom-settings-ui
make a JS controller that takes the form values to build an aggregates property in the probe's config, and use it in a simple SQL probe like SELECT ${aggregates} FROM ${DKU_DATASET_TABLE_NAME} ${DKU_WHERE_PARTITION_FILTER}
Regards,
Frederic