List all failed rows based on qulity rule
HI all,
i would like to create a dataset based on data quality rules where it failed. I would like to list (create table) all of the failed rows to able to send it to the team what needs to be changed. Failed or not is not enough I need to be able to collect every line where the rules are failed. Did not find and option. Tried to add filtering and prepare recipes but did not find an option.
Operating system used: Windows
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,073 Neuron
I don’t think there is a better way. If you want the rows that breach the rules then this what you need to do. Personally I would setup all the column rules to either return 0 for rule passed and 1 for rule failed. Then you can use create simple data quality rules that fail when the value is not zero and filter for any rows with those columns > 0. Since the logic to return 0 or 1 will be in the column not the data quality rule you are not technically recreating the rule, you are just merely exposing your column data quality rule as a Dataiku data quality rule to integrate with existing functionality and allow for them to be shown when they fail.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,073 Neuron
I don't think there are any built-in features for your requirement. Data quality rules are computed at dataset level, not at row level. You can add a dataset to your flow to show all the data quality rules via Dataset ⇒ Internal ⇒ Metrics and review current and historic DQ rules outcomes at project or instance level. But like I said this would be at dataset level not at row level.
However nothing stops you from easily building this functionality yourself. You will need to move your data quality rules into column values so you can identify which rows are failing your data quality rules. Then either user a Filter Recipe or a Split Recipe to output the rows that fail your data quality rules into a separate flow branch so you can do send them to your DQ team to fix.
-
it means I need to create quality based columns for each rule then create a flag column where i flag all error. In this case I need to setup the rules 2 times. Is there any better option please? thank you