Conundrum 9: Month Forecasting

MichaelG
Community Manager
Community Manager
Conundrum 9: Month Forecasting
Generic Community Conundrums - header for posts25.png

 

Welcome to Conundrum 9 - We know your love for modeling, so here's another dataset for you to get stuck into!
 
Attached is data pertaining to various weather conditions in a certain location for every day of the year 2019. This data includes minimum and maximum temperature, precipitation levels, windspeed and more!
 
Can you build a model that uses this data to predict the month in which each point was collected? Some important features are likely to be obvious to you - but let's see what the model can find! Bear in mind that while we have prepared the data somewhat 'month' currently isn't a column - so you will need to get stuck into data prep yourself.
 
Of course building the model is just the start - refining is key. Share your methods and most significant features here and together we can reach new heights!
 
 
Note: A value of 99.9, 999.9, and 9999,9 indicate a missing reading, don't go thinking there was 999cm of snow almost every day that year!
 
I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
2 Replies
taraku
Dataiker

Aww.... what a wowzer! I can't wait to try to figure this one out!

ben_p
Level 5

I've taken a first go at breaking this down, the first job was to parse the date column into an actual date, then I broke out the individual date elements - day, month and year:

I selected the month column as my target, interestingly the automated setup initially suggested this should be a regression problem, which I then changed to multi-class classification.

I trained this first dataset using the Decision Tree, Logistic Regression and Random Forest Algorithms, with the later winning on ROC score:

Capture.PNG

Now for my favourite part of any modelling process, feature importance! 

Capture.PNG

Suggesting that the max temperature is the best indicator for a month, makes sense, right? White interestingly rainfall is actually not a great indicator for month.

Here's a nice visualisation of the Random Forest split, starting with mean temperature:

Capture.PNG

Would anyone like to suggest some ways to refine this very simply start?

Ben