Time series resampling 'KeyError'

valentinaprotti · March 2021

Hi everyone, first post in search for help I am currently working on a time series dataset, to prepare it for the new time series forecast plugin. In short, as a result of a sequence of recipes I have obtained a dataset with:

a parsed date (date), tprelievo
a product code (integer), prodid
a pickup q.ty (decimal)
a requested q.ty (decimal)

On the right is a little preview of the dataset.

It should be a simple multiple multivariate time series, with prodid being the identifier. I am now trying to resample it with the time series preparation plugin to have one record per time series per week, because dates are unevenly distanced, and the following is the resampling recipe setup that gives me the error:

After running this recipe, in a few seconds the following error appears:

I cannot understand its meaning. Can it be caused by prodid being integer (all the examples had a string column as the identifier)? Or is there something else I'm overlooking?

Please let me know if I should share the whole log too for better understanding. Thanks in advance

Ignacio_Toledo · March 2021

Hi @valentinaprotti
,

Welcome to the community!

Could you share the whole error message? In your screenshot is missing the part where the python error is displayed. With that information we could help you more!

Cheers

xav56 · April 2021

I have the exact same issue as @valentinaprotti
, I don't know if you were able to solve it?

i have simplified my table to 3 columns:

weekParsed (containing date type that are supposed to be weekly, but some weeks are missing),
CUSTid which is a customer ID as integer, that I'd like to use as "long format" in the table
REAL which is the sales value i'm trying to predict. I'd just like to add '0' where there is no value for a missing week

do you have any idea ?

here is the error log:

(......)
[2021/04/10-17:25:20.299] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,296 INFO Dataiku Python entrypoint starting up
[2021/04/10-17:25:20.300] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,296 INFO executable = /home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python
[2021/04/10-17:25:20.302] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,296 INFO argv = ['/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/python-exec-wrapper.py', '/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/script.py']
[2021/04/10-17:25:20.304] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,297 INFO --------------------
[2021/04/10-17:25:20.304] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,297 INFO Looking for RemoteRunEnvDef in ./remote-run-env-def.json
[2021/04/10-17:25:20.305] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,297 INFO Found RemoteRunEnvDef environment: ./remote-run-env-def.json 
[2021/04/10-17:25:20.306] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,298 INFO Running a DSS Python recipe locally, uinsetting env
[2021/04/10-17:25:20.307] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,299 INFO Setup complete, ready to execute Python code
[2021/04/10-17:25:20.309] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,300 INFO Sys path: ['/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy', '/home/dataiku/dss/lib/python', '/home/dataiku/dataiku-dss-9.0.1/python', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages', '/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib/python3.6/site-packages', '/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/localconfig/projects/LPR1/lib/python', '/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib']
[2021/04/10-17:25:20.311] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,300 INFO Script file: /home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/script.py
[2021/04/10-17:25:24.773] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:24,772 INFO Computing for group: 1
[2021/04/10-17:25:24.781] [null-err-36] [INFO] [dku.utils]  - *************** Recipe code failed **************
[2021/04/10-17:25:24.782] [null-err-36] [INFO] [dku.utils]  - Begin Python stack
[2021/04/10-17:25:24.796] [null-err-36] [INFO] [dku.utils]  - Traceback (most recent call last):
[2021/04/10-17:25:24.796] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
[2021/04/10-17:25:24.797] [null-err-36] [INFO] [dku.utils]  -     return self._engine.get_loc(key)
[2021/04/10-17:25:24.797] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
[2021/04/10-17:25:24.798] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
[2021/04/10-17:25:24.798] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021/04/10-17:25:24.799] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021/04/10-17:25:24.800] [null-err-36] [INFO] [dku.utils]  - KeyError: 'CUSTid'
[2021/04/10-17:25:24.801] [null-err-36] [INFO] [dku.utils]  - During handling of the above exception, another exception occurred:
[2021/04/10-17:25:24.802] [null-err-36] [INFO] [dku.utils]  - Traceback (most recent call last):
[2021/04/10-17:25:24.803] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/python-exec-wrapper.py", line 206, in <module>
[2021/04/10-17:25:24.804] [null-err-36] [INFO] [dku.utils]  -     exec(f.read())
[2021/04/10-17:25:24.805] [null-err-36] [INFO] [dku.utils]  -   File "<string>", line 23, in <module>
[2021/04/10-17:25:24.806] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/resampling.py", line 87, in transform
[2021/04/10-17:25:24.806] [null-err-36] [INFO] [dku.utils]  -     group_resampled = self._resample(group.drop(groupby_columns, axis=1), datetime_column, columns_to_resample, reference_time_index, df_id=group_id)
[2021/04/10-17:25:24.807] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/resampling.py", line 127, in _resample
[2021/04/10-17:25:24.808] [null-err-36] [INFO] [dku.utils]  -     filtered_columns_to_resample = filter_empty_columns(df, columns_to_resample)
[2021/04/10-17:25:24.810] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/dataframe_helpers.py", line 22, in filter_empty_columns
[2021/04/10-17:25:24.811] [null-err-36] [INFO] [dku.utils]  -     if np.sum(df[col].notnull()) > 1: # in fact we filter out columns with less than 2 values
[2021/04/10-17:25:24.811] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
[2021/04/10-17:25:24.812] [null-err-36] [INFO] [dku.utils]  -     return self._getitem_column(key)
[2021/04/10-17:25:24.813] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
[2021/04/10-17:25:24.813] [null-err-36] [INFO] [dku.utils]  -     return self._get_item_cache(key)
[2021/04/10-17:25:24.814] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
[2021/04/10-17:25:24.814] [null-err-36] [INFO] [dku.utils]  -     values = self._data.get(item)
[2021/04/10-17:25:24.815] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
[2021/04/10-17:25:24.815] [null-err-36] [INFO] [dku.utils]  -     loc = self.items.get_loc(item)
[2021/04/10-17:25:24.816] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
[2021/04/10-17:25:24.817] [null-err-36] [INFO] [dku.utils]  -     return self._engine.get_loc(self._maybe_cast_indexer(key))
[2021/04/10-17:25:24.818] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
[2021/04/10-17:25:24.819] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
[2021/04/10-17:25:24.819] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021/04/10-17:25:24.820] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
[2021/04/10-17:25:24.820] [null-err-36] [INFO] [dku.utils]  - KeyError: 'CUSTid'
[2021/04/10-17:25:24.821] [null-err-36] [INFO] [dku.utils]  - End Python stack[2021/04/10-17:25:24.821] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:24,798 INFO Check if spark is available
[2021/04/10-17:25:24.822] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:24,800 INFO Not stopping a spark context: No module named 'pyspark'
[2021/04/10-17:25:24.977] [FRT-33-FlowRunnable] [WARN] [dku.resource] - stat file for pid 2899 does not exist. Process died?
[2021/04/10-17:25:24.979] [FRT-33-FlowRunnable] [INFO] [dku.resourceusage] - Reporting completion of CRU:{"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"LPR1","jobId":"Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032","activityId":"compute_deliveryWeeklyTime_NP","activityType":"recipe","recipeType":"CustomCode_timeseries-preparation-resampling","recipeName":"compute_deliveryWeeklyTime"},"type":"LOCAL_PROCESS","id":"VbJeXPAfdOBWDzPc","startTime":1618075520039,"localProcess":{"pid":2899,"commandName":"/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python","cpuUserTimeMS":40,"cpuSystemTimeMS":10,"cpuChildrenUserTimeMS":0,"cpuChildrenSystemTimeMS":0,"cpuTotalMS":50,"cpuCurrent":0.0,"vmSizeMB":121,"vmRSSMB":4,"vmHWMMB":4,"vmRSSAnonMB":1,"vmDataMB":1,"vmSizePeakMB":121,"vmRSSPeakMB":4,"vmRSSTotalMBS":0,"majorFaults":3,"childrenMajorFaults":0}}
[2021/04/10-17:25:24.980] [FRT-33-FlowRunnable] [INFO] [dku.usage.computeresource.jek] - Reporting completion of resource usage: {"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"LPR1","jobId":"Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032","activityId":"compute_deliveryWeeklyTime_NP","activityType":"recipe","recipeType":"CustomCode_timeseries-preparation-resampling","recipeName":"compute_deliveryWeeklyTime"},"type":"LOCAL_PROCESS","id":"VbJeXPAfdOBWDzPc","startTime":1618075520039,"endTime":1618075524979,"localProcess":{"pid":2899,"commandName":"/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python","cpuUserTimeMS":40,"cpuSystemTimeMS":10,"cpuChildrenUserTimeMS":0,"cpuChildrenSystemTimeMS":0,"cpuTotalMS":50,"cpuCurrent":0.0,"vmSizeMB":121,"vmRSSMB":4,"vmHWMMB":4,"vmRSSAnonMB":1,"vmDataMB":1,"vmSizePeakMB":121,"vmRSSPeakMB":4,"vmRSSTotalMBS":0,"majorFaults":3,"childrenMajorFaults":0}}
[2021/04/10-17:25:24.981] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Error file found, trying to throw it: /home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/error.json
[2021/04/10-17:25:24.981] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Raw error is{"errorType":"\u003cclass \u0027KeyError\u0027\u003e","message":"CUSTid","detailedMessage":"At line 23: \u003cclass \u0027KeyError\u0027\u003e: CUSTid","stackTrace":[]}
[2021/04/10-17:25:24.982] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Now err: {"errorType":"\u003cclass \u0027KeyError\u0027\u003e","message":"Error in Python process: CUSTid","detailedMessage":"Error in Python process: At line 23: \u003cclass \u0027KeyError\u0027\u003e: CUSTid","stackTrace":[]}
[2021/04/10-17:25:24.987] [FRT-33-FlowRunnable] [INFO] [dku.flow.activity] - Run thread failed for activity compute_deliveryWeeklyTime_NP
(.........)

Ignacio_Toledo · April 2021

Hi @xav56
,

Thanks for sharing the complete log! It took some time but I was able to reproduce the problem.

Summary: There is a bug in the code that should be reported to the plugin owners (I think they are from dataiku).

Quick work around: transform the column "CUSTid" or whatever column you would like to use to do the grouping for the long format to a string. It must contain one character other than a digit.

Detailed answer: There is bug that happens when the column to be used for the long format could be recognized by pandas as either an integer or a double. When the column is a string the plugin work as expected.

The problem is in the python script "dku_timeseries/resampling.py", within the method "transform" of the "Resample" class, line 87:

columns_to_resample = [col for col in df_copy.select_dtypes([int, float]).columns.tolist() if col != datetime_column]

So, any column which is either an integer or a float is assumed to be a column that will need to be resample. In your case, and in the case of @valentinaprotti
, the columns to be used for the long format grouping are integers, and so they are wrongly detected as "columns_to_resample".

The bug raises when later in the process we come into the resampling algorithm, which will work with a dataframe where the grouping column is no longer present (because it was used to create the groups), but the list "columns_to_resample" still contains it:

filtered_columns_to_resample = filter_empty_columns(df, columns_to_resample)
....
# filter_empty_columns function is defined in another file, but here it is

def filter_empty_columns(df, columns):
    filtered_columns = []
    for col in columns: # <- columns contain a column not longer available in df
        if np.sum(df[col].notnull()) > 1: # in fact we filter out columns with less than 2 values
            filtered_columns.append(col)
    return filtered_columns

@CoreyS
or any other dataiker, should a ticket be created in the support page?

Hope this helps!

tgb417 · April 2021

@Ignacio_Toledo
,

Amazing set of debuging on your part.

CoreyS · April 2021

Thanks @Ignacio_Toledo
i've reported the potential bug to the plugins team.

xav56 · April 2021

I've added a letter at the beginning of the CUSTid column, it works!!

A big thank you @Ignacio_Toledo
for helping

Xavier

Clemence · April 2021

Hi,
Thanks @Ignacio_Toledo
for helping on this! Indeed we have an issue when the identifier column is numerical, for long format. This will be fixed in the next release of the plugin (should be soon).

Time series resampling 'KeyError'

Answers

Categories

Setup Info

Tags