Metrics and Checks

unkxowx · February 2024

Hey, hope y'all are doing great

I have a quick question regarding using metrics and checks.

I have a table and I want to do basic checks such as duplication of primary keys, duplication of records, missing values in a particular column etc.

I used metrics to compute the count all of these along with checks and have them displayed in a dashboard where clicking on the tile takes you back to the sub-dataset i.e., if there are null values in the foreign key column of my dataset, the tile displaying that error will take me to that specific subset where all the foreign key values are null.

These 'specific datasets were created using SQL recipe, so basically, I had to split my dataset into multiple flows with SQL query for each check and then do my checks on those individual subsets.

This has resulted in creating many many workflows (literally one for each check) and having to display that subset is crucial for the use-case. Is there a way to optimize this that does not require in creating multiple flows?

Thank you for the assistance

Turribeach · February 2024

So the right way of doing this is to have your metrics added as columns so you can filter the rows as needed in the original dataset, no needs to create separate datasets. So in your dataset create new columns called "Duplicate_Flag" and populate it accordingly. The Null in column ABC doesn't need a separate column as you can filter on it already. Then you can create a new insight which could basically be your dataset with the correct filters applied for each metric drill down. Finally you can edit the metric tile in your dashboard "Behavior on click" to "Open other Insight" which will point to the corresponding filtered Insight.

Turribeach · February 2024

I don't see why you need to create all these sub-datasets. Can you explain your requirement more clearly?

unkxowx · February 2024

so lets say i have 2 dq checks that i need to do:
1) check duplicate records (rows having same values but different primary keys)
2) check records with null values in column ABC

Now, with metrics and checks i was able to obtain:
1) # of records that are duplicates
2) # of records that have null values in ABC

In the dashboard, there will be a tile that displays a check of either OKAY or ERROR based on the metric numbers. Let's say both my checks have errors. Now, what I want to achieve is that by clicking on each tile I want to see a dataset that has:
1) only those records that are duplicates
2) only those rows that have null values in ABC.

unkxowx · February 2024

ok agreed with applying filters on nulls and creating a new col for duplicates but my flow zone will have 3 datasets now, right? One original, other with filter recipe on duplicate_flag and last one with the filter of null values in ABC ?

So, if i got this right, you're saying use only the original dataset to calculate ALL the metrics but then use filter recipes to filter them out and have those sub-datasets in the flow that will be used in the dashboard? Or is there a way to actually apply filter on the original dataset in the dashboard itself?

Turribeach · February 2024

Not really. I never said using Recipes or Filters. I said you should use an Insight:

https://doc.dataiku.com/dss/latest/dashboards/concepts.html#insights

An insight can have filters applied to it. Hence all you need is separate insights for each metric. All of them could come from a single dataset.

unkxowx · February 2024

ah okay, I'll have a look at that then. Thank you so much for your help

Turribeach · February 2024

Under the same menu as Dashboards you have Insights. Select Insights. Select the dataset and give the Insight a name. Then click on it, Edit it and filter as desired. Finally link from the metric dashboard tile as I previously advised.

unkxowx · February 2024

Yeah, I think I just created one and that look a lot more feasible than having a weird flow in the zone. Thank you, I really appreciate it, it's my first time using Dataiku so kind of got lost in between.

unkxowx · February 2024

Actually, one more thing, I have set up scenarios to auto-refresh data, compute metrics and checks. But that build data from scratch with my snowflake connection. Is partitioning something that should be utilized here intead?

Turribeach · February 2024

New question, new thread. Please mark this thread as accepted answer.

unkxowx · February 2024

sure, will do another thread. Before that, in continuation to what you had said already in regards to creating a flag column for duplicates - what approach do you suggest for doing that? concatenation of all columns? etc?

unkxowx · February 2024

Flagging Duplicate Rows - Dataiku Community

Metrics and Checks

Best Answer

Answers

Categories

Setup Info

Tags