Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Question | Fraud detection

adf057
Level 2
Question | Fraud detection

Working on a use case to identify if an expense claim is a fraud or not.

I have a dataset of 200K thousand claim for the last year of which 10% were identified as suspicious. Out of those 10% , 7% were cleared and 3% were rejected.

Dataset:
Daily claim report, with fields -

Report Id
User
Employee type
City
Expense type (taxi, visa etc -distinct values 60)
Currency
Amount
Submitted date
HasReceipt
IsFraud


Any thoughts on what new features can be created and how to approach this problem?

Using random class sampling by class ratio for balancing the dataset

0 Kudos
2 Replies
MiguelangelC
Dataiker
Dataiker

Hi,

Improving the performance of the model involves considering a number of variables specific of each particular analysis that go beyond the features selection.

To make it simple, they can be very roughtly divided into:
1) The significance of the train and test data

2) The selected features for the modelling

3) The selected algorithm and hyperparameter search.

If during your design process you've come to find the issue lies on the features selection, you can generate more based on interactions between the already existing features. For that you can automatically select linear and polynomial interactions. You could also provide your own formulas. You can find this option going to your visual analysis and from there to Design > Features > Features generation.
Similarly, it is possible your existing features require some preprocessing, which can be done in the Features handling section.

Ultimately, as said at the beginning it is critical not to understimate the importance of the effect the algorithm used for the predictions and the quality of the data used for the training/validation have in the results.

 

0 Kudos
adf057
Level 2
Author

Thank you for the response.

My rationale towards generating new features were, like, time since last claim, avg. min. max. std. of the claim amount by user by expense type.

I will initially run the first model as is, see the importance of features , and see if any new features can improve the performance, if needed.

 

If you have any other thoughts on how to approach the problem, that would be great as well.

0 Kudos