Scoring with TS Visual recipes/ Time series resampling
Hi, community!
I am trying to use the Time Series modelling Visual Recipes to train & predict using DeepAR on a dataset with sales in monthly buckets.
My date column is set to an end of month format e.g.:
.... |
2022-09-30T00:00:00.000Z |
2022-10-31T00:00:00.000Z |
2022-11-30T00:00:00.000Z |
2022-12-31T00:00:00.000Z |
The problem occurs when I try to use the trained model to score a dataset in which the last available month is 31 days long, like 2022-12-31T00:00:00.000Z.
When I do this, the predictions that are being output will always skip the first month, so when the last date in the dataset being scored is December '22 like in the example above, the first month for which the scoring recipe generates a prediction is February '23 instead of January '23.
If however the last month is 30 or less days long, the scoring recipe behaves as expected. E.g. when the last month in the dataset being scored is November '22, the first month for which the scoring recipe generates a prediction is December '22.
My model is set to forecast 4 months and skip 0.
P.S.
As I am under the impression that the TS Visual recipes and the old Time series Preparation plugin are connected - I have noticed similar odd behavior when using the Time series resampling recipe.
If I resample the same dataset that is already in monthly buckets, when the last month is 31 days long, the output of the recipe will contain an additional month (e.g. if my last month is Dec '22, the output will also contain Jan '23). This again does not happen if the last month in the dataset is 30 or less days long.
This leads me to believe that the scoring recipe might be resampling in the background and creating that additional month in the same fashion, therefore seeing it as part of the input dataset and not requiring prediction.
Any help from people that have encountered this or Dataiku folks would be highly appreciated!
Dataiku Version 11.0.3
Operating system used: Windows 10
Operating system used: Windows 10
Answers
-
Hi, thanks for reporting that, indeed this is a known bug, we resample one more time step than necessary when the last timestamp is exactly the last day of the month.
We are currently working on it for a coming release on both the Visual Timeseries Forecasting feature and the Timeseries preparation plugin.
-
In the meantime, a workaround would be to not use the last day of the month as timestamps, for instance by shifting all timestamps by one day.
-
Thanks for the quick answer, Stan!
Do you mean shift the timestamps only for the dataset being scored or also for the one the training is done on?
-
You would need to shift both datasets (for training and scoring) because they both use the same resampling method.
Also, if you don't do any extrapolation, this issue won't happen because without extrapolation, no timestamps after the last timestamp of the input dataset can be added.
-
Thanks, Stan!
Turned off extrapolation since it's not needed for my dataset (used settings in attached screenshot) and the issue went away.
I'll also try the shifting of timestamps in case extrapolation will ever be needed. Wouldn't this clash with gluonts requiring the timestamps for monthly buckets to be either end of month or beginning of the month formats, like specified below?
-
Good to know it worked for you when turning off extrapolation !
Shifting the timestamps was only a workaround to make the resampling work (not adding an extra date), but anyway after resampling all dates are end of the month just like gluonts requires.