Template Submission - Predicting the Sakura Blooming Day
Name: Makoto Miyazaki
Title: Data Scientist
Description: Dataiku is the world’s leading AI and machine learning platform, supporting agility in organizations’ data efforts via collaborative, elastic, and responsible AI, all at enterprise scale. Hundreds of companies use Dataiku to underpin their essential business operations and ensure they stay relevant in a changing world.
Sakura, the world-famous cherry blossom in Japan, happens every year in the spring. It is a world-renowned attraction, and many people travel from far to witness its wonders. However, sakura blooms only for a short period of time: seven days after the flowers open, they already start to scatter, so many people simply miss it. As I’m a Data Scientist at Dataiku, I took it as a challenge to build a prediction model for the bloom of Sakura using Dataiku DSS - and see if I could obtain more accurate predictions than other websites!
Dataiku enabled me to automatically update the prediction on a daily basis, thanks to the scenario automation feature of Dataiku DSS:
Everyday at 2 a.m., a Python recipe scraped the weather information in the three cities from the previous day and updated the predictions like the chart below:
The two other main forecasting websites, tenki.jp and Japan Weather Association (JWA) respectively updated their prediction once a week and every two weeks. Daily updates are a big plus to gather more accurate forecasts on a precise blooming day!
My Dataiku DSS flow can be seen below:
It consists of two zones: data pre-processing and machine learning:
Inputs: daily weather data from 1991 until today in the three cities as well as the historical blooming days from the past 30 years + daily weather data scraped from the Japan Meteorological Agency using a Python recipe, including average, highest, and lowest temperature, precipitation, and daylight hours.
Feature generation with a window recipe: generating rolling averages during the past one month, three months, and six months for each of the weather-related variables for each of the three cities. I also made an average of the blooming days during previous years for each city, assuming that the blooming day does not differ much from year to year.
Includes two Random Forest algorithms, scoring one dataset for each:
These two scored datasets are then combined to create a single prediction result. I made it this way because I set the target variable to “number of days until blossom.” This target variable itself takes a value between 0 and 365 (or even more). But I wanted to tell the model to look at this as a cyclical variable, so that it can correctly assess the error. For this, I scaled the variable to a range of 0 to 2π, then decomposed it into sine and cosine. Therefore, one model predicted the sine value, another predicted the cosine value. I combined the prediction results and reversed it to a day unit.
I run my prediction, humoristically called ‘Random Sakura Forest’, for three cities in Japan: Oita prefecture (southern Japan), Aomori prefecture (northern Japan), and Tokyo.
My predictions were a few days behind the two other forecasting websites:
Sakura in Tokyo bloomed on March 15, so my prediction was already proven wrong - but both of the other forecasting websites also missed the forecast, although at a smaller extent.
However, my predictions were closer regarding Oita, which blossomed on March 24 (I predicted March 22 vs other websites had March 15). I was also more in line, together with JWA, regarding Aomori, which opened on April 14 (predicted April 21).
A random forest with 500 trees and the maximum depth of 100 yielded the best result, and I was able to reduce the error to four days.
One interesting finding is that the model favored only the temperature-related features. All the other features, such as precipitation and daylight hour, had very little impact on the result.
In Japan, forecasting the Sakura blooming day is the daily news headline throughout spring. Hence since the 1950s, a lot of methodologies have been addressed, including multiple regression analysis. Nowadays, most of the Sakura blossom forecasters use a formula based on a method developed by Yasuyuki Aono, Associate Professor at Osaka Metropolitan University, in 2003.
Aono’s approach is unique in a way that it’s composed of two parts well-incorporating the biology of Sakura trees. First, it computes a D-day, where the trees wake up from their sleep during the winter time. This D-day is computed from a place’s latitude, distance from the sea shore, and average temperature during January and March, which therefore depends on the place.
What the Aono method tells us is that the blooming day depends solely on the place’s geographical position and its temperature, which is indeed consistent with my prediction result!