Submit your use case or success story to the 2023 edition of the Dataiku Frontrunner Awards

# Including a variable indirectly influenced by the prediction.

Solved!
###### Including a variable indirectly influenced by the prediction.

I am using DataIku to predict which of our job sites will have injuries. It has been very successful in lowering our Injury Rate and improving our safety performance. One way it has done this is by requiring bi-weekly management audits on jobs flagged as high risk for safety defects. We have data to show that high risk jobs with audits are less likely to have injuries.

Now the critical matter. Audits are a lot of work, and management is wanting to improve the model so they can do fewer audits. Management correctly argues that if a job is high risk, and they perform an audit, some of the risk has been mitigated and the chance of an injury should decrease.

I put a flag on whether a job has had an audit or not and re-trained the model. Of course, the model flagged jobs with audit as having a higher risk because... we require audits on jobs that the model already flagged as high risk for an injury.

Can you think of any way to mitigate that bias?

(If you are curious about my model, I did a user group presentation on it, you can watch here: https://community.dataiku.com/t5/Online-Events/Defect-Detection-Watch-on-Demand/ba-p/5687 )

1 Solution
Dataiker

I'd suggest using a 35% threshold for the un-audited jobs. Since the audited jobs are 7.3 percentage points less likely to have an event, I'd use a threshold of at least 42% for those jobs. This is absolutely something you can move upwards if you feel that the positive effect of audit is still not being fully captured.

The simplest way to implement solution number one would be to have your scoring recipe output the probabilities. To do this, select the 'Output probabilities' option in the scoring recipe.

You can then use a prepare recipe to create a new column that has the "Post-Audit" prediction. A simple formula like this would be able to create that
column: if(proba_1>0.42,1,0)

You will then be able to access both the 'Pre-Audit' prediction and the 'Post-Audit' prediction.

11 Replies

This is a really cool use of a model.  And a great example of leaking in information from unintended data sources.   And a great example of where the "art" of model design can be hard.

I think that the issue is that you are not randomizing your deployment of audits because of the cost of audits.

I do not have a good answer.

That said, I'm wondering about implementing more random audits.  (Maybe in the form of short-form audits or self-audits)  Then building a feature that is about the successful completion of these short-form self-audits. This could be a few Likert scales about how dangerous the activity is perceived to be.  How well the team is in managing the danger.   And maybe a text feature asking for comments.  (2-3 minutes max. and randomly deployed over time.)

Because those quick form surveys would be deployed to all projects.  This would not be tied to most hazardous activities so less unexpected information is leaking into the model.  This would produce an increase in general safety awareness. And might show signs of slippage in safety in an area. Particularly if scores change from one sample to the next.

Those are my \$0.02.  Love to hear what you end up choosing to do.

--Tom
Author

In the immediate future, I've come up with a work around. It isn't perfect, but I think it makes sense.

We had 812 jobs labeled High Risk with no audits (or no audits before an event took place) since 2018, 543 of those jobs had an event (67%).
We had 171 jobs labeled High Risk with audits since 2018, 102 of those jobs had an event (60%).

That is a 7% risk reduction.

Can it be easy enough to subtract .07 from the proba_1 field when an audit is completed?

Hmmm.

I'm no statistician, however, removing 7% seems a bit. How to say this.... Suspect?  (It might be valid, I just don't really know.)

I'm wondering if some of the data scientists working at Dataiku might be able to help out by jumping on this thread.

cc: @CoreyS

--Tom
Dataiker Alumni

Hi @AaronCrouch and thanks for the heads up @tgb417. I will have someone contact you directly about your question. Thanks again!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

If you feel comfortable about the results of the conversation with Dataiku staff folks.  I'd love to hear back about what kinds of conclusions you have drawn about approaching this interesting problem.

--Tom
Author

@tgb417 - I certainly will keep you updated.

@CoreyS - when should I be hearing from them?

Dataiker Alumni

Hi @AaronCrouch speaking with the Dataiker now, should be reaching out to you shortly.

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

Dataiker

Hi @AaronCrouch ,

Love the use case! I have a few thoughts on how you could factor in the audits. As @tgb417 mentioned, there is a bit of art to this.

Here are some solutions:
• Use two different probability thresholds for classifying what counts as high risk - a lower one for non-audited jobs and a higher one for audited jobs.
• Use sample weights to assign higher importance to jobs that have been audited when building your model.
• Treat jobs that have been audited as categorically different and build a second model for them.
Based on the conversation, I think option #1 is the closest match to what you are looking for.
Author

@AndrewS, thanks for your response. It looks like option 1 is similar to an idea I had above. I'm just not sure what thresholds to use. Currently, I'm using a 35% threshold suggested by the system. Let me throw some numbers at you and see if you have any thoughts about how to adjust:

We had 812 jobs labeled High Risk with no audits (or no audits before an event took place) since 2018, 543 of those jobs had an event (66.9%).
We had 171 jobs labeled High Risk with audits since 2018, 102 of those jobs had an event (59.6%).

Taking the sum of proba_1 on high risk jobs with no audit (or no audit prior to a defect), we would predict 482.75 jobs with at least one defect; we had 543 high risk jobs with no audit with a defect (112%)
Taking the sum of proba_1 on high risk jobs WITH an audit, we would predict 99.9 jobs with at lease one defect; we had 102 jobs with a defect (102%)

How would one implement suggestion 1 in DSS?  Would one create a Partitioned Model, one with audits and one without?  Or is there some other way to achieve what you are suggesting?

--Tom
Dataiker

I'd suggest using a 35% threshold for the un-audited jobs. Since the audited jobs are 7.3 percentage points less likely to have an event, I'd use a threshold of at least 42% for those jobs. This is absolutely something you can move upwards if you feel that the positive effect of audit is still not being fully captured.

The simplest way to implement solution number one would be to have your scoring recipe output the probabilities. To do this, select the 'Output probabilities' option in the scoring recipe.

You can then use a prepare recipe to create a new column that has the "Post-Audit" prediction. A simple formula like this would be able to create that
column: if(proba_1>0.42,1,0)

You will then be able to access both the 'Pre-Audit' prediction and the 'Post-Audit' prediction.