Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

Conundrum 7: The Job Hunt

Community Manager
Community Manager
Conundrum 7: The Job Hunt
Generic Community Conundrums - header for posts8.png
Provided is (anonymised) data about over five hundred thousand people polled in the UK 2011 census. It includes each persons sex, age, marital status and other points.
 
Using all of this information can you build a model that predicts if the person is currently in work - either Employed or Self-Employed?
 
As you will quickly find you should be able to build a very accurate model because the data includes - 'hours worked per week', 'economic activity', and 'industry' data-points. So to turn things up a notch once you have built your first model try removing those and building a new one on 'hard mode'! 
 
You will need to refer to the following Key if you wish to know what the values in the data really mean! https://www.ons.gov.uk/census/2011census/2011censusdata/censusmicrodata/microdatateachingfile/variab...
 
Source: Office for National Statistics licensed under the Open Government Licence v.1.0.
I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?
21 Replies
Dataiker
Dataiker

delicious challenge! i'm going to try to create a model without 'industry' or 'hours worked per week'. does this mean 'economic activity' is out as well? i could probably use the 'easy' model to see how well the following impacted the target: age, social grade, health, family composition.

Level 7

@taraku

How goes your modeling?

--Tom
0 Kudos
Dataiker
Dataiker

Hello! The modeling goes well! However, so far, I can't seem to build a model that isn't 1.00 or close to 1.00. I think I'll try removing all of the obvious features like "student" and "hours worked" and "economic activity" and see what I get when using just the demographics.

0 Kudos
Level 7

@taraku 

Have you made any further progress?

See the conversation below.  We have discovered that.

  • Making sure that you drop the features suggested in the Conundrum description help.
  • Making sure that you handle your features correctly.  Are they Numerical or Categorical features?

--Tom

 

--Tom
Dataiker
Dataiker

Tom - thanks for the tip! i forgot to check that part!

0 Kudos
Level 7

Thanks for the new dataset to dig into.

I note the varibles in the data set are almost all encoded as numerical values.  However, when looking at the data dictionary these values are not numerical in nature they are actually catagorical in nature, including age.  Where the value 1 = 0-15 years (or a 16 year period) and 2 = 16 - 24 and 8 years period.   The standard Visual ML setup treats all of these values as if they were numeric not catagorical in nature.  So, I'm going to think about how I'll deal with the Model Setup or some type of re-coding of variables.  

Also the amount of data here keeps my 6 year old laptop busy for 15 minutes to compute all of the data.  I'll have to look at some better sampeling techniques.  Starting with the first n records seems to be potentially problematic.

 

--Tom
Community Manager
Community Manager
Author

@taraku good point - I'll add that to the list for hard mode.

 

@tgb417 Glad your enjoying it!
Nice spot - I thought that particular little wrinkle would provide some interesting challenges.

And yes I wouldn't advice first n records - I don't know that the data is in any particular order but it wouldn't pay to assume it wasn't! 

I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?
Level 7

How are others doing with the Conundrum?  It seems like a dataset for which one can build a fairly high-quality model.  Currently, I'm getting results in the ROC AUC ~= .95 .

I would enjoy hearing from others about what they are seeing and some of the approaches under exploration.

--Tom
Level 3

Wow, that's one great result, how did you achieve that? That'd be an interesting case to look at.

Level 7

@gerryleonugroho ,

In visual Analytics the Random Forest, Gradient Boosted Tree, XGBoost, and Artificial Neural Network all seem to like this data. Once I realized that most of the data is categorical even though it is represented in a number format.

Features.jpg

With the defaults in DSS.  These are all treated as numbers.  In that case, I'm only getting ROC AUC down in the ~= .917 to .928.  So choosing the defaults does not get one to more useful answers.  Moving to Categorical for feature handling helps.

For example, here are the results of one of my runs.

Random Forrest.jpg

That said, I'm still concerned about overfitting and information leakage.  Many of the features have the -9 Catagory.  "No Code required (Students or schoolchildren living away during term-time)" 

The N/As above that are showing up as important features seem like a bad idea to me because of course most of this group is not working.  What do you think?

--Tom

--Tom
Level 3

Hi @tgb417 

Yeach, I had the similar problem with yours, though mine was even a bit overwhelming, first attempt of the recipes flow went off like this.

 

- Using the "Extra Trees Algorithm", managed to get 0.977 as R2 Score.

Screen Shot 2020-05-14 at 14.32.02.png

 

- And of course, changed the variables type to categorical.Screen Shot 2020-05-14 at 14.33.20.png

 

- And here's the screen capture for the algorithm part.

Screen Shot 2020-05-14 at 14.33.41.png

While here's for the second attempt:

- Got a 1.00 as R2 Score.

Screen Shot 2020-05-14 at 15.03.05.png

 

- As an optimization, I did couple of tweaking in columns variables and turned some of them into a boolean format.

Screen Shot 2020-05-14 at 15.03.58.png

- It turned out, I have an easier to navigate and understand Decision Tree output.

Screen Shot 2020-05-14 at 15.05.32.png

- While here's the cost matrix.

Screen Shot 2020-05-14 at 15.09.56.png

- And here's how the finalized process would take place, with "2" being referred as "Employed" and "3" as "Unemployed". 

Screen Shot 2020-05-14 at 15.14.30.png

 

It was fun doing this dataset, really enjoyed it! 😊

 

Cheers,

GLN

 
 

 

Level 7

@gerryleonugroho 

Have you tried the more realistic variant of the problem where you leave out (turn off) the variable ‘hours worked per week', 'economic activity', and 'industry' as suggested by @MichaelG. in the conundrum.  

In re reviewing the challenge and my notes I’ve discovered that I have not fully met those requirements either. In the version I showed I notice that I’ve left on ‘industry’. That may be the reason I’m getting such high results. I’m “leaking in” the answer to the prediction of the target variable. In my case, one is unlikely to have an industry if one is not working. Therefore by leaving industry turned on the work status is leaking in to my model, through this Highly correlated data. This is not so much a prediction about work status, but the ability of the model to recognize that in this dataset if an industry exists then the person is working.  I’m going back and reworking my model.  This same sort of leakage occurs with the other variables being recommended for exclusion. 

In addition I’ve noticed that you have just made the sex variable into a categorical variable. Some of the other variables on the list are also not actually representing numerical range.  A feature like age is a range. There really are lower values of age and higher values of age, and that has meaning.  I know that the feature is a value because I can answer the question is one age less than greater or equal to another.  But, I’m not clear that I can make the same kind of less than greater or equal assertion about say religion or ethnicity. I’d like to invite you to consider what other features listed in our data set are actually numerical ranges and which are actually categorical values hidden in a numerical representation. 

have fun. Thanks so much for sharing.  

--Tom
Level 3

Hi Tom @tgb417,

Yeap that was so much fun in doing it, and so true about the categorical/numerical analysis, so here goes my third attempt on this. I waived out hours worked per week', 'economic activity', and 'industry' and relied only on the following features only, "Family Composition", "Sex", "Age", "Marital Status" and "Country of Birth".

Screen Shot 2020-05-15 at 00.18.36.png

 

- And let it run on Random Forest, as per the following:

Screen Shot 2020-05-15 at 00.18.54.png

- And for sure, seeing a significant drop for the ROC score down to 0.811.

Screen Shot 2020-05-15 at 00.16.59.png

But it still decent enough, and had a good time running them, thanks for the inputs Tom 😊.

 

-- Cheers

Level 7

@gerryleonugroho ,

I pulled out 'industry' from my latest model run.  As the scores seem unchanged still in the ~= .95 ROC AUC.

 

--Tom
Level 7

For example after re-mapping age categories.  I was able to produce a sensible partial dependence graph that looks like the following on the models I'm working on with this data.  Age is usually the top feature for me.  But never lower than top 5 feature.

Age Partial Dependency.jpg

True for the Target variable means that the person is working.

--Tom
Level 3

@gerryleonugrohoI'm surprised your performance metric was R2-score in the beginning as this kind of metrics is for regression only (thus not for classification). Have you checked that the prediction type was binary classification? 🙂

-Anita

Level 3

Hi @anita-clmnt , actually I'm just following along with the rest of the thread post. Well aside from the fact, it is rather a fun, quick and dirty analysis. But on top of it all, it's a fun dataset to work with, and keeping you at home during this uncertain times. Certainly hope that you enjoy them too. 😊

Ps: Stays safe during this pandemic.

Cheers,

 

Level 7

@anita-clmnt ,

Based on this data dictionary,  do you have any thought about how best to handle the -9 values in many of the columns with DSS. 

"-9.   No code required (Resident of a communal establishment,
 students or schoolchildren living away during term-time, 
or a short-term resident)"

This value seems to duplicate students and it's getting a lot of weight because it is leaking into so many features.

I sort of want to do one hot encoding for everything other than these -9 values.  (At this point I've emptied out these values.  But they keep showing up as high-value features.  

minus9 values.jpg

I've re-coded the -9 as NA above.

Thoughts?

Hope you are well. 

--Tom
Level 3

@tgb417,

Those -9 values are not real missing values as they actually mean something. For example, for Occupation, it represents all the "people aged under 16, people who have never worked and students or schoolchildren living away during term-time". Because of that, getting models showing those categories as important is rather reassuring as it is clear that a large proportion of this category is currently not working. Therefore it helps us discriminate between the two types of individuals.

And actually  Occupation=-9 doesn't duplicate 'Student' as the table below shows so I would keep them both!

 
 

count by occupationcount by occupation

 

One-hot encoding is a good idea! None of the variables has some kind of order between its categories so we can definitely use it!

I actually tried to run models with and without  (without using Industry, Hours worked per week and Economic Activity of course). The unfold processor produces a sparse matrix which makes the model way faster to train but we have many categorical features so we end up with a large number of variables once all of them are one-hot encoded. I got very similar AUC scores: 0.967 without one-hot encoding and 0.966 with one-hot encoding (trained on only 75% of the data, I ran out of memory with the whole dataset) using XGBoost with a 3-fold cross-validation and some hyperparameter tuning on a subsample.

Do you know if it is written somewhere what encoding method is used by each algorithm in dss?

-Anita

Level 7

@anita-clmnt Thanks for jumping into the conversation.

I agree that the -9 has a meaning.  However, during earlier looks at the data dictionary.  I had read these all as students.  While students is common on all of the -9 it is not the only meaning. Now that I look at this with more care. I see that it is a mish-mash of meanings related to why the data takers are not going to record this data.

  • Family composition
    • -9.   No code required (Resident of a communal establishment, students or schoolchildren living away during term-time, or a short-term resident)
  • Country of birth
    • -9. No Code required (Students or schoolchildren living away during term-time)
  • Health
    • 9.  No code required (Students or schoolchildren living away during term-time)
  • Ethnic group
    • - 9   No code required (Not resident in England or Wales, students or schoolchildren living away during term-time)
  • Religion
    • -9.   No code required (Not resident in England or Wales, students or schoolchildren living away during term-time)
  • Economic activity
    • -9.   No code required (Aged under 16 or students or schoolchildren living away during term-time)
  • Occupation
    • -9.   No code required (People aged under 16, people who have never worked and students or schoolchildren living away during term-time)
  • Industry
    • -9.   No code required (People aged under 16, people who have never worked, and students or schoolchildren living away during term-time) 
  • Hours worked per week
    • -9.  No code required (People aged under 16, people not working, and students or schoolchildren living away during term-time)
  • Approximated social grade
    • -9. No code required (People aged under 16, people resident in communal establishments, and students or schoolchildren living away during term-time)

Most of those values are better described by other existing columns.  

For example:

  • people not working or people who have never worked
    • This seems like our target variable leaking into the analysis in the column occupation.  (This might be why many of my models are hitting on this as the most important variable.)
  • People aged under 16
    • This might be more accurately modeled in the Age Column
  • resident in communal establishments
    • This might be more accurately measured by the Residence type column
  • Not resident in England or Wales
    • Maybe this is modeled by the population base
  • students
    • this might be more accurately measured by the students column.  
  • students living away from home
    • Maybe better modeled by the population base as well.
  • Short-term resident
    • This is also covered by a column in the population base

I'm wondering about doing one-hot encoding and then eliminating or combining some of the options.  Thanks for the pointer to the unfold visual recipe step.  I may give that a bit of a try.  

I've been training a random sample of ~100,000 to 275,000 records on my 8GB on my 6-year-old laptop using gradient boosted trees. I've been getting a .965 AUC with 10-fold cross-validation. 

I then tested the model of the remainder of the data that was never shown to the model and got results that were almost exactly the same.  

When you ask about the encoding method are you asking about the feature handling of something else?

feature handeling.jpg

One of the ways I would look at this question myself is I'd take a look at the actual code DSS is producing for the model.  

Looking Under the Hood of a DSS Model.jpg

And then you can take the jupyter Notebook and tear it a part.

Jupyter Notebook of Model.jpg

Thoughts.  

--Tom