Join us on Wednesday, June 3rd for a deep dive into Customer Predictive Analytics Learn more

Conundrum 1: Law and Order Modeling

Highlighted
Community Manager
Community Manager
Conundrum 1: Law and Order Modeling

Generic Community Conundrums - header for posts2 (1).png

Welcome to our first Community Conundrum! Get ready to put your deerstalkers on and dive into the world of crime! Below you will find a dataset containing crime data relating to March 2019 in Hampshire, UK. This data includes a few things including the type of crime, location where it took place etc. 

Can you use this data to build a model that predicts if the accused received any form of punishment? 

For the sake of clarity we count the following, and only the following, outcomes as punishment - you may spot a certain common word to make filtering your data easier! 

  • Offender given a caution
  • Offender given community sentence
  • Offender fined
  • Offender given conditional discharge
  • Offender sent to prison
  • Offender given suspended prison sentence
  • Offender ordered to pay compensation 
  • Offender deprived of property
  • Offender given a drugs possession warning

You can find the data attached. Now go ahead and get modeling! 

Once you have a solution please feel free to export your project and upload it here so we can all benefit from each other's efforts! Refer to the guidance in our Submission guidelines  to see how to properly export your project.

I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?
10 Replies
Highlighted
Dataiker
Dataiker

I imported the dataset, selected the Lab, built a visual analysis, parsed the date column, then selected the "Last Outcome Category" column to use in building the prediction model. Then in the chart for Random Forest, I created a filter on "Last Outcome Category" and selected to filter on the word "Offender".

Highlighted
Level 5

@taraku ,

I've had a look at your example.  Thanks so much for sharing.

I see that you are treating the problem as a multi-class classification problem.  Predicting each of the "last outcome categories".  

When I read @MichaelG description of the Conundrum.  

Can you use this data to build a model that predicts if the accused received any form of punishment? 

For the sake of clarity we count the following, and only the following, outcomes as punishment - you may spot a certain common word to make filtering your data easier! 
Offender given a caution
  • Offender given community sentence
  • Offender fined
  • Offender given conditional discharge
  • Offender sent to prison
  • Offender given suspended prison sentence
  • Offender ordered to pay compensation 
  • Offender deprived of property
  • Offender given a drugs possession warning

To me, this looked like a 2 class model that is needed.  A class where the "last outcome category" starts with the word offender.  And one where the "last outcome category" does not start with the word offender.    That gives us a fairly unbalanced target with:

Offender as Target.jpg

@MichaelG what were you wanting us to try to predict?  All of the outcomes or just "Offenders"?

@taraku is there a value to treating this a multiclass rather than two class?

--Tom
Highlighted
Dataiker
Dataiker

Hello @tgb417 ! Thanks for the post! I think you are on to something! I agree with your analysis....it's more a question of "if" and less a question of "what". In other words, to solve this, I should predict whether or not they received "any" form of punishment rather than trying to predict "what" punishment was received. Hmmmm...this is a fun one!!!...

Highlighted
Level 5

@taraku 

When we take on this second "if" perspective.  We end up having a fairly unbalanced class we are looking for with 2.5% of folks falling into the offender class.  What is your thought about approaching that imbalance?  I know that Dataiku has some support for sampling in model building.  

Do others think that using this to get more even classes is a good idea?

If so what is the best approach?

Down Sampeling.jpg

--Tom
Highlighted
Level 3

Hi Tom,

I first tried to model this conundrum using the sampling method you were mentioning  to rebalance the dataset. The problem is that the number of observations for Punishment=True is quite low and lead to models with poor f1-scores. I would have like to try over-sampling methods like SMOTE but dataiku dss does not offer this option, at least for now 😉

In the end I created my own weight column to use as a sample weight in my models. All the rows for which Punishment=True were given a weight of 36 and the other ones a weight of 1. 

I then trained a random forest using the accuracy metric. Usually I don't pick this one when facing an imbalanced dataset issue but when you use a sample weight in dataiku dss, all the metrics are re-weighted accordingly and with the weight column I defined, I got a metric close to what is usually called balanced accuracy. 

With some feature engineering, I reached a 65% balanced accuracy (i computed the metric from the confusion matrix directly, to be more precise). There is clearly room for improvement! Any ideas for improvement? 🙂 

Highlighted
Level 5

@anita-clmnt ,

Thanks for sharing your project.  I had just a little bit of a hard time loading the job.  But once I re-uploaded the original file I can see what you are doing.

What do you make out of the rows with no last outcome category?  I assumed that they were not offenders.  And I think that these also don't have Crime IDs.  I had found that the presence or lack of a Crime ID was a good feature.  Now I realize that that feature was probably bogus.  You took the approach to delete these records.  I'm beginning to believe that you did the better thing.  I like how you used the extract command rather than what I did with a formula.

How did you decide on your various locations Winchester Winchester2...  Why were you taking the distance from those specific geographic points?  I've been playing around with a clustering.  However, I have not figured out a way to get a distance measure from the center of the offender's hot spots.  Thoughts?

 

--Tom
Highlighted
Level 3

I finally decided to create a python recipe to implement SMOTENC (SMOTE for both continuous and categorical variables, over-sampling technique). I had to divide the dataset into train and test beforehand to avoid having 'synthetic' records in my test set. It would have lead to overestimated performance metrics. I got a balanced accuracy of 0.71 this time which is a great improvement compared to last time.

I indeed preferred to remove the rows with no last outcome category to avoid introducing any kind of bias trying to impute the missing values.

For the 'DistFromWinchester' variable, I first created geo points using the latitude and longitude variables. Then I computed the distance using the coordinates of the center of the city (latitude=51,063202, longitude=-1,308 according to Google). Both processors are in Geography and are super useful!

'DistFromWinchester2' is just the square of 'DistFromWinchester'.

Tom did I answer your questions? And do you have other ideas?

-Anita

Highlighted
Level 5

@anita-clmnt ,

Thanks for uploading your most recent project.

I'm learning a bit more about importing projects.  Because you have gone to some python code I had to map your python code environment to one of my Code environments.  We will see how this works out.

I'm also learning a bit more about how you are handling the weighting.  Was there any science to the selection of the values 10, 36, and 40 in the first model and 8 in the second model.  Or did that come out of the "Art" in Data Science?

When I try to look at the partial dependencies on your first model, in  Quick modeling of Punishment found on 2019_03_hampshshire_street_prepared.  I'm ending up with some error messages.  Are you seeing the same?

Errors when producing Partial Dependencies.jpg

The top "external code" ends in the following line: 

"/Applications/DataScienceStudio.app/Contents/Resources/kit/python/dataiku/doctor/posttraining/partial_depency.py", line 293, in _predict_and_get_pd_value    weights = self.sample_weights[pred.index]IndexError: index 12481 is out of bounds for axis 1 with size 10000

If you are not seeing this I'll open a support ticket on this.  It may be something going on here.  However, for other project's I'm not seeing this problem.  I may have something to do with my re-mapping of Python environment to open your project file.

Thanks for sharing your use of the library imbalance on what you did with SMOTENC.  Well done.

 

--Tom
Highlighted
Level 5

@anita-clmnt 

I've tried a few more quick items with the data in the file we were provided and have not produced any significantly better results.

At one point I did try to add some resident age-related information and buckets in 5 year age groups.

https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datas...

the data joins on LSOA Code in our current file to Area Codes in this new file.  However, I'm not clear that this made a big difference, also when I asked this was outside the scope of the Conundrums.  

I also looked at using the LSO name without code.  (I had not done a lot of the distance calculations you did.)  It seemed that certain jurisdictions had different approaches.

LSO name wo code.jpg

--Tom