Try your hand at analyzing royal sentiment in Dataiku DSS! Learn more

Conundrum 1: Law and Order Modeling

Community Manager
Community Manager
Conundrum 1: Law and Order Modeling

Generic Community Conundrums - header for posts2 (1).png

Welcome to our first Community Conundrum! Get ready to put your deerstalkers on and dive into the world of crime! Below you will find a dataset containing crime data relating to March 2019 in Hampshire, UK. This data includes a few things including the type of crime, location where it took place etc. 

Can you use this data to build a model that predicts if the accused received any form of punishment? 

For the sake of clarity we count the following, and only the following, outcomes as punishment - you may spot a certain common word to make filtering your data easier! 

  • Offender given a caution
  • Offender given community sentence
  • Offender fined
  • Offender given conditional discharge
  • Offender sent to prison
  • Offender given suspended prison sentence
  • Offender ordered to pay compensation 
  • Offender deprived of property
  • Offender given a drugs possession warning

You can find the data attached. Now go ahead and get modeling! 

Once you have a solution please feel free to export your project and upload it here so we can all benefit from each other's efforts! Refer to the guidance in our Submission guidelines  to see how to properly export your project.

I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?
11 Replies
Dataiker
Dataiker

I imported the dataset, selected the Lab, built a visual analysis, parsed the date column, then selected the "Last Outcome Category" column to use in building the prediction model. Then in the chart for Random Forest, I created a filter on "Last Outcome Category" and selected to filter on the word "Offender".

Level 6

@taraku ,

I've had a look at your example.  Thanks so much for sharing.

I see that you are treating the problem as a multi-class classification problem.  Predicting each of the "last outcome categories".  

When I read @MichaelG description of the Conundrum.  

Can you use this data to build a model that predicts if the accused received any form of punishment? 

For the sake of clarity we count the following, and only the following, outcomes as punishment - you may spot a certain common word to make filtering your data easier! 
Offender given a caution
  • Offender given community sentence
  • Offender fined
  • Offender given conditional discharge
  • Offender sent to prison
  • Offender given suspended prison sentence
  • Offender ordered to pay compensation 
  • Offender deprived of property
  • Offender given a drugs possession warning

To me, this looked like a 2 class model that is needed.  A class where the "last outcome category" starts with the word offender.  And one where the "last outcome category" does not start with the word offender.    That gives us a fairly unbalanced target with:

Offender as Target.jpg

@MichaelG what were you wanting us to try to predict?  All of the outcomes or just "Offenders"?

@taraku is there a value to treating this a multiclass rather than two class?

--Tom
Dataiker
Dataiker

Hello @tgb417 ! Thanks for the post! I think you are on to something! I agree with your analysis....it's more a question of "if" and less a question of "what". In other words, to solve this, I should predict whether or not they received "any" form of punishment rather than trying to predict "what" punishment was received. Hmmmm...this is a fun one!!!...

Level 6

@taraku 

When we take on this second "if" perspective.  We end up having a fairly unbalanced class we are looking for with 2.5% of folks falling into the offender class.  What is your thought about approaching that imbalance?  I know that Dataiku has some support for sampling in model building.  

Do others think that using this to get more even classes is a good idea?

If so what is the best approach?

Down Sampeling.jpg

--Tom
Level 3

Hi Tom,

I first tried to model this conundrum using the sampling method you were mentioning  to rebalance the dataset. The problem is that the number of observations for Punishment=True is quite low and lead to models with poor f1-scores. I would have like to try over-sampling methods like SMOTE but dataiku dss does not offer this option, at least for now 😉

In the end I created my own weight column to use as a sample weight in my models. All the rows for which Punishment=True were given a weight of 36 and the other ones a weight of 1. 

I then trained a random forest using the accuracy metric. Usually I don't pick this one when facing an imbalanced dataset issue but when you use a sample weight in dataiku dss, all the metrics are re-weighted accordingly and with the weight column I defined, I got a metric close to what is usually called balanced accuracy. 

With some feature engineering, I reached a 65% balanced accuracy (i computed the metric from the confusion matrix directly, to be more precise). There is clearly room for improvement! Any ideas for improvement? 🙂 

Level 6

@anita-clmnt ,

Thanks for sharing your project.  I had just a little bit of a hard time loading the job.  But once I re-uploaded the original file I can see what you are doing.

What do you make out of the rows with no last outcome category?  I assumed that they were not offenders.  And I think that these also don't have Crime IDs.  I had found that the presence or lack of a Crime ID was a good feature.  Now I realize that that feature was probably bogus.  You took the approach to delete these records.  I'm beginning to believe that you did the better thing.  I like how you used the extract command rather than what I did with a formula.

How did you decide on your various locations Winchester Winchester2...  Why were you taking the distance from those specific geographic points?  I've been playing around with a clustering.  However, I have not figured out a way to get a distance measure from the center of the offender's hot spots.  Thoughts?

 

--Tom
Level 3

I finally decided to create a python recipe to implement SMOTENC (SMOTE for both continuous and categorical variables, over-sampling technique). I had to divide the dataset into train and test beforehand to avoid having 'synthetic' records in my test set. It would have lead to overestimated performance metrics. I got a balanced accuracy of 0.71 this time which is a great improvement compared to last time.

I indeed preferred to remove the rows with no last outcome category to avoid introducing any kind of bias trying to impute the missing values.

For the 'DistFromWinchester' variable, I first created geo points using the latitude and longitude variables. Then I computed the distance using the coordinates of the center of the city (latitude=51,063202, longitude=-1,308 according to Google). Both processors are in Geography and are super useful!

'DistFromWinchester2' is just the square of 'DistFromWinchester'.

Tom did I answer your questions? And do you have other ideas?

-Anita

Level 6

@anita-clmnt ,

Thanks for uploading your most recent project.

I'm learning a bit more about importing projects.  Because you have gone to some python code I had to map your python code environment to one of my Code environments.  We will see how this works out.

I'm also learning a bit more about how you are handling the weighting.  Was there any science to the selection of the values 10, 36, and 40 in the first model and 8 in the second model.  Or did that come out of the "Art" in Data Science?

When I try to look at the partial dependencies on your first model, in  Quick modeling of Punishment found on 2019_03_hampshshire_street_prepared.  I'm ending up with some error messages.  Are you seeing the same?

Errors when producing Partial Dependencies.jpg

The top "external code" ends in the following line: 

"/Applications/DataScienceStudio.app/Contents/Resources/kit/python/dataiku/doctor/posttraining/partial_depency.py", line 293, in _predict_and_get_pd_value    weights = self.sample_weights[pred.index]IndexError: index 12481 is out of bounds for axis 1 with size 10000

If you are not seeing this I'll open a support ticket on this.  It may be something going on here.  However, for other project's I'm not seeing this problem.  I may have something to do with my re-mapping of Python environment to open your project file.

Thanks for sharing your use of the library imbalance on what you did with SMOTENC.  Well done.

 

--Tom
Level 6

@anita-clmnt 

I've tried a few more quick items with the data in the file we were provided and have not produced any significantly better results.

At one point I did try to add some resident age-related information and buckets in 5 year age groups.

https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datas...

the data joins on LSOA Code in our current file to Area Codes in this new file.  However, I'm not clear that this made a big difference, also when I asked this was outside the scope of the Conundrums.  

I also looked at using the LSO name without code.  (I had not done a lot of the distance calculations you did.)  It seemed that certain jurisdictions had different approaches.

LSO name wo code.jpg

--Tom
Level 3

@tgb417 ,

For the different weights, I tried to rebalance the data each time. So, for my last model with some over-sampling for example, I ended up with 4000 records for no punishment (I undersampled this class a bit in my Python recipe) and 500 records for punishment in my training set. That's why I gave a weight of 8 to the punishment class (4000/500).

I indeed got the same error when I tried to compute partial dependences.  It seems to be linked to the fact that I used sample weights but I think it is a bug.

Even if outside the conundrum scope, your idea of using other datasets was interesting and we could probably end up, with a bit of work, with some smart feature engineering!

--Anita

Dataiker
Dataiker

Hi @MichaelG  found some time to spend on this!

At a high level what's being done is:

- Create target column with formula.

- Remove irrelevant columns.

- Create geopoints to the cities in Hampshire with over 40K inhabitants, calculate distances to them from the event geopoint,

- Prepare and replace values on several columns, a little housekeeping and cleaning up of values since otherwise, the cardinality will be too high for the amount of data we have.

Simple models with some class rebalancing and that's it!

PS As others mentioned there could be scope to include demographic data and wider conviction rates for crimes committed, etc. This is likely overkill on a one-month specific county crime data and won't boost performance by much.