Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on May 10, 2021 11:37AM
Likes: 1
Replies: 0
Name:
Makoto Miyazaki
Title:
Data Scientist
Country:
France
Organization:
Dataiku
Description:
Dataiku is the world’s leading AI and machine learning platform, supporting agility in organizations’ data efforts via collaborative, elastic, and responsible AI, all at enterprise scale. Hundreds of companies use Dataiku to underpin their essential business operations and ensure they stay relevant in a changing world.
Awards Categories:
Sakura, the world-famous cherry blossom in Japan, happens every year in the spring. It is a world-renowned attraction, and many people travel from far to witness its wonders. However, sakura blooms only for a short period of time: seven days after the flowers open, they already start to scatter, so many people simply miss it. As I’m a Data Scientist at Dataiku, I took it as a challenge to build a prediction model for the bloom of Sakura using Dataiku DSS - and see if I could obtain more accurate predictions than other websites!
Dataiku enabled me to automatically update the prediction on a daily basis, thanks to the scenario automation feature of Dataiku DSS:
Everyday at 2 a.m., a Python recipe scraped the weather information in the three cities from the previous day and updated the predictions like the chart below:
The two other main forecasting websites, tenki.jp and Japan Weather Association (JWA) respectively updated their prediction once a week and every two weeks. Daily updates are a big plus to gather more accurate forecasts on a precise blooming day!
My Dataiku DSS flow can be seen below:
It consists of two zones: data pre-processing and machine learning:
Data preprocessing
Machine learning
Includes two Random Forest algorithms, scoring one dataset for each:
These two scored datasets are then combined to create a single prediction result. I made it this way because I set the target variable to “number of days until blossom.” This target variable itself takes a value between 0 and 365 (or even more). But I wanted to tell the model to look at this as a cyclical variable, so that it can correctly assess the error. For this, I scaled the variable to a range of 0 to 2π, then decomposed it into sine and cosine. Therefore, one model predicted the sine value, another predicted the cosine value. I combined the prediction results and reversed it to a day unit.
I run my prediction, humoristically called ‘Random Sakura Forest’, for three cities in Japan: Oita prefecture (southern Japan), Aomori prefecture (northern Japan), and Tokyo.
My predictions were a few days behind the two other forecasting websites:
A random forest with 500 trees and the maximum depth of 100 yielded the best result, and I was able to reduce the error to four days.
One interesting finding is that the model favored only the temperature-related features. All the other features, such as precipitation and daylight hour, had very little impact on the result.
In Japan, forecasting the Sakura blooming day is the daily news headline throughout spring. Hence since the 1950s, a lot of methodologies have been addressed, including multiple regression analysis. Nowadays, most of the Sakura blossom forecasters use a formula based on a method developed by Yasuyuki Aono, Associate Professor at Osaka Metropolitan University, in 2003.
Aono’s approach is unique in a way that it’s composed of two parts well-incorporating the biology of Sakura trees. First, it computes a D-day, where the trees wake up from their sleep during the winter time. This D-day is computed from a place’s latitude, distance from the sea shore, and average temperature during January and March, which therefore depends on the place.
What the Aono method tells us is that the blooming day depends solely on the place’s geographical position and its temperature, which is indeed consistent with my prediction result!