Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Conundrum 9: Month Forecasting

Community Manager
Community Manager
Conundrum 9: Month Forecasting
Generic Community Conundrums - header for posts25.png


Welcome to Conundrum 9 - We know your love for modeling, so here's another dataset for you to get stuck into!
Attached is data pertaining to various weather conditions in a certain location for every day of the year 2019. This data includes minimum and maximum temperature, precipitation levels, windspeed and more!
Can you build a model that uses this data to predict the month in which each point was collected? Some important features are likely to be obvious to you - but let's see what the model can find! Bear in mind that while we have prepared the data somewhat 'month' currently isn't a column - so you will need to get stuck into data prep yourself.
Of course building the model is just the start - refining is key. Share your methods and most significant features here and together we can reach new heights!
Note: A value of 99.9, 999.9, and 9999,9 indicate a missing reading, don't go thinking there was 999cm of snow almost every day that year!
I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
2 Replies

Aww.... what a wowzer! I can't wait to try to figure this one out!

Level 5

I've taken a first go at breaking this down, the first job was to parse the date column into an actual date, then I broke out the individual date elements - day, month and year:

I selected the month column as my target, interestingly the automated setup initially suggested this should be a regression problem, which I then changed to multi-class classification.

I trained this first dataset using the Decision Tree, Logistic Regression and Random Forest Algorithms, with the later winning on ROC score:


Now for my favourite part of any modelling process, feature importance! 


Suggesting that the max temperature is the best indicator for a month, makes sense, right? White interestingly rainfall is actually not a great indicator for month.

Here's a nice visualisation of the Random Forest split, starting with mean temperature:


Would anyone like to suggest some ways to refine this very simply start?