Question | Fraud detection

adf057 Registered Posts: 6 ✭✭✭

Working on a use case to identify if an expense claim is a fraud or not.

I have a dataset of 200K thousand claim for the last year of which 10% were identified as suspicious. Out of those 10% , 7% were cleared and 3% were rejected.

Daily claim report, with fields -

Report Id
Employee type
Expense type (taxi, visa etc -distinct values 60)
Submitted date

Any thoughts on what new features can be created and how to approach this problem?

Using random class sampling by class ratio for balancing the dataset


  • Miguel Angel
    Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker


    Improving the performance of the model involves considering a number of variables specific of each particular analysis that go beyond the features selection.

    To make it simple, they can be very roughtly divided into:
    1) The significance of the train and test data

    2) The selected features for the modelling

    3) The selected algorithm and hyperparameter search.

    If during your design process you've come to find the issue lies on the features selection, you can generate more based on interactions between the already existing features. For that you can automatically select linear and polynomial interactions. You could also provide your own formulas. You can find this option going to your visual analysis and from there to Design > Features > Features generation.
    Similarly, it is possible your existing features require some preprocessing, which can be done in the Features handling section.

    Ultimately, as said at the beginning it is critical not to understimate the importance of the effect the algorithm used for the predictions and the quality of the data used for the training/validation have in the results.

  • adf057
    adf057 Registered Posts: 6 ✭✭✭

    Thank you for the response.

    My rationale towards generating new features were, like, time since last claim, avg. min. max. std. of the claim amount by user by expense type.

    I will initially run the first model as is, see the importance of features , and see if any new features can improve the performance, if needed.

    If you have any other thoughts on how to approach the problem, that would be great as well.

Setup Info
      Help me…