Time series resampling 'KeyError'
Hi everyone, first post in search for help
- a parsed date (date), tprelievo
- a product code (integer), prodid
- a pickup q.ty (decimal)
- a requested q.ty (decimal)
On the right is a little preview of the dataset.
It should be a simple multiple multivariate time series, with prodid being the identifier. I am now trying to resample it with the time series preparation plugin to have one record per time series per week, because dates are unevenly distanced, and the following is the resampling recipe setup that gives me the error:
After running this recipe, in a few seconds the following error appears:
I cannot understand its meaning. Can it be caused by prodid being integer (all the examples had a string column as the identifier)? Or is there something else I'm overlooking?
Please let me know if I should share the whole log too for better understanding. Thanks in advance
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi @valentinaprotti
,Welcome to the community!
Could you share the whole error message? In your screenshot is missing the part where the python error is displayed. With that information we could help you more!
Cheers
-
I have the exact same issue as @valentinaprotti
, I don't know if you were able to solve it?i have simplified my table to 3 columns:
- weekParsed (containing date type that are supposed to be weekly, but some weeks are missing),
- CUSTid which is a customer ID as integer, that I'd like to use as "long format" in the table
- REAL which is the sales value i'm trying to predict. I'd just like to add '0' where there is no value for a missing week
do you have any idea ?
here is the error log:
(......) [2021/04/10-17:25:20.299] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,296 INFO Dataiku Python entrypoint starting up [2021/04/10-17:25:20.300] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,296 INFO executable = /home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python [2021/04/10-17:25:20.302] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,296 INFO argv = ['/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/python-exec-wrapper.py', '/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/script.py'] [2021/04/10-17:25:20.304] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,297 INFO -------------------- [2021/04/10-17:25:20.304] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,297 INFO Looking for RemoteRunEnvDef in ./remote-run-env-def.json [2021/04/10-17:25:20.305] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,297 INFO Found RemoteRunEnvDef environment: ./remote-run-env-def.json [2021/04/10-17:25:20.306] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,298 INFO Running a DSS Python recipe locally, uinsetting env [2021/04/10-17:25:20.307] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,299 INFO Setup complete, ready to execute Python code [2021/04/10-17:25:20.309] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,300 INFO Sys path: ['/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy', '/home/dataiku/dss/lib/python', '/home/dataiku/dataiku-dss-9.0.1/python', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages', '/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib/python3.6/site-packages', '/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/localconfig/projects/LPR1/lib/python', '/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib'] [2021/04/10-17:25:20.311] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:20,300 INFO Script file: /home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/script.py [2021/04/10-17:25:24.773] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:24,772 INFO Computing for group: 1 [2021/04/10-17:25:24.781] [null-err-36] [INFO] [dku.utils] - *************** Recipe code failed ************** [2021/04/10-17:25:24.782] [null-err-36] [INFO] [dku.utils] - Begin Python stack [2021/04/10-17:25:24.796] [null-err-36] [INFO] [dku.utils] - Traceback (most recent call last): [2021/04/10-17:25:24.796] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc [2021/04/10-17:25:24.797] [null-err-36] [INFO] [dku.utils] - return self._engine.get_loc(key) [2021/04/10-17:25:24.797] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc [2021/04/10-17:25:24.798] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc [2021/04/10-17:25:24.798] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item [2021/04/10-17:25:24.799] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item [2021/04/10-17:25:24.800] [null-err-36] [INFO] [dku.utils] - KeyError: 'CUSTid' [2021/04/10-17:25:24.801] [null-err-36] [INFO] [dku.utils] - During handling of the above exception, another exception occurred: [2021/04/10-17:25:24.802] [null-err-36] [INFO] [dku.utils] - Traceback (most recent call last): [2021/04/10-17:25:24.803] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/python-exec-wrapper.py", line 206, in <module> [2021/04/10-17:25:24.804] [null-err-36] [INFO] [dku.utils] - exec(f.read()) [2021/04/10-17:25:24.805] [null-err-36] [INFO] [dku.utils] - File "<string>", line 23, in <module> [2021/04/10-17:25:24.806] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/resampling.py", line 87, in transform [2021/04/10-17:25:24.806] [null-err-36] [INFO] [dku.utils] - group_resampled = self._resample(group.drop(groupby_columns, axis=1), datetime_column, columns_to_resample, reference_time_index, df_id=group_id) [2021/04/10-17:25:24.807] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/resampling.py", line 127, in _resample [2021/04/10-17:25:24.808] [null-err-36] [INFO] [dku.utils] - filtered_columns_to_resample = filter_empty_columns(df, columns_to_resample) [2021/04/10-17:25:24.810] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/dataframe_helpers.py", line 22, in filter_empty_columns [2021/04/10-17:25:24.811] [null-err-36] [INFO] [dku.utils] - if np.sum(df[col].notnull()) > 1: # in fact we filter out columns with less than 2 values [2021/04/10-17:25:24.811] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__ [2021/04/10-17:25:24.812] [null-err-36] [INFO] [dku.utils] - return self._getitem_column(key) [2021/04/10-17:25:24.813] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column [2021/04/10-17:25:24.813] [null-err-36] [INFO] [dku.utils] - return self._get_item_cache(key) [2021/04/10-17:25:24.814] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache [2021/04/10-17:25:24.814] [null-err-36] [INFO] [dku.utils] - values = self._data.get(item) [2021/04/10-17:25:24.815] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/internals.py", line 4115, in get [2021/04/10-17:25:24.815] [null-err-36] [INFO] [dku.utils] - loc = self.items.get_loc(item) [2021/04/10-17:25:24.816] [null-err-36] [INFO] [dku.utils] - File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc [2021/04/10-17:25:24.817] [null-err-36] [INFO] [dku.utils] - return self._engine.get_loc(self._maybe_cast_indexer(key)) [2021/04/10-17:25:24.818] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc [2021/04/10-17:25:24.819] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc [2021/04/10-17:25:24.819] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item [2021/04/10-17:25:24.820] [null-err-36] [INFO] [dku.utils] - File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item [2021/04/10-17:25:24.820] [null-err-36] [INFO] [dku.utils] - KeyError: 'CUSTid' [2021/04/10-17:25:24.821] [null-err-36] [INFO] [dku.utils] - End Python stack[2021/04/10-17:25:24.821] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:24,798 INFO Check if spark is available [2021/04/10-17:25:24.822] [null-err-36] [INFO] [dku.utils] - 2021-04-10 17:25:24,800 INFO Not stopping a spark context: No module named 'pyspark' [2021/04/10-17:25:24.977] [FRT-33-FlowRunnable] [WARN] [dku.resource] - stat file for pid 2899 does not exist. Process died? [2021/04/10-17:25:24.979] [FRT-33-FlowRunnable] [INFO] [dku.resourceusage] - Reporting completion of CRU:{"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"LPR1","jobId":"Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032","activityId":"compute_deliveryWeeklyTime_NP","activityType":"recipe","recipeType":"CustomCode_timeseries-preparation-resampling","recipeName":"compute_deliveryWeeklyTime"},"type":"LOCAL_PROCESS","id":"VbJeXPAfdOBWDzPc","startTime":1618075520039,"localProcess":{"pid":2899,"commandName":"/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python","cpuUserTimeMS":40,"cpuSystemTimeMS":10,"cpuChildrenUserTimeMS":0,"cpuChildrenSystemTimeMS":0,"cpuTotalMS":50,"cpuCurrent":0.0,"vmSizeMB":121,"vmRSSMB":4,"vmHWMMB":4,"vmRSSAnonMB":1,"vmDataMB":1,"vmSizePeakMB":121,"vmRSSPeakMB":4,"vmRSSTotalMBS":0,"majorFaults":3,"childrenMajorFaults":0}} [2021/04/10-17:25:24.980] [FRT-33-FlowRunnable] [INFO] [dku.usage.computeresource.jek] - Reporting completion of resource usage: {"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"LPR1","jobId":"Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032","activityId":"compute_deliveryWeeklyTime_NP","activityType":"recipe","recipeType":"CustomCode_timeseries-preparation-resampling","recipeName":"compute_deliveryWeeklyTime"},"type":"LOCAL_PROCESS","id":"VbJeXPAfdOBWDzPc","startTime":1618075520039,"endTime":1618075524979,"localProcess":{"pid":2899,"commandName":"/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python","cpuUserTimeMS":40,"cpuSystemTimeMS":10,"cpuChildrenUserTimeMS":0,"cpuChildrenSystemTimeMS":0,"cpuTotalMS":50,"cpuCurrent":0.0,"vmSizeMB":121,"vmRSSMB":4,"vmHWMMB":4,"vmRSSAnonMB":1,"vmDataMB":1,"vmSizePeakMB":121,"vmRSSPeakMB":4,"vmRSSTotalMBS":0,"majorFaults":3,"childrenMajorFaults":0}} [2021/04/10-17:25:24.981] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Error file found, trying to throw it: /home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/error.json [2021/04/10-17:25:24.981] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Raw error is{"errorType":"\u003cclass \u0027KeyError\u0027\u003e","message":"CUSTid","detailedMessage":"At line 23: \u003cclass \u0027KeyError\u0027\u003e: CUSTid","stackTrace":[]} [2021/04/10-17:25:24.982] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Now err: {"errorType":"\u003cclass \u0027KeyError\u0027\u003e","message":"Error in Python process: CUSTid","detailedMessage":"Error in Python process: At line 23: \u003cclass \u0027KeyError\u0027\u003e: CUSTid","stackTrace":[]} [2021/04/10-17:25:24.987] [FRT-33-FlowRunnable] [INFO] [dku.flow.activity] - Run thread failed for activity compute_deliveryWeeklyTime_NP
(.........) -
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi @xav56
,Thanks for sharing the complete log! It took some time but I was able to reproduce the problem.
Summary: There is a bug in the code that should be reported to the plugin owners (I think they are from dataiku).
Quick work around: transform the column "CUSTid" or whatever column you would like to use to do the grouping for the long format to a string. It must contain one character other than a digit.
Detailed answer: There is bug that happens when the column to be used for the long format could be recognized by pandas as either an integer or a double. When the column is a string the plugin work as expected.
The problem is in the python script "dku_timeseries/resampling.py", within the method "transform" of the "Resample" class, line 87:
columns_to_resample = [col for col in df_copy.select_dtypes([int, float]).columns.tolist() if col != datetime_column]
So, any column which is either an integer or a float is assumed to be a column that will need to be resample. In your case, and in the case of @valentinaprotti
, the columns to be used for the long format grouping are integers, and so they are wrongly detected as "columns_to_resample".The bug raises when later in the process we come into the resampling algorithm, which will work with a dataframe where the grouping column is no longer present (because it was used to create the groups), but the list "columns_to_resample" still contains it:
filtered_columns_to_resample = filter_empty_columns(df, columns_to_resample) .... # filter_empty_columns function is defined in another file, but here it is def filter_empty_columns(df, columns): filtered_columns = [] for col in columns: # <- columns contain a column not longer available in df if np.sum(df[col].notnull()) > 1: # in fact we filter out columns with less than 2 values filtered_columns.append(col) return filtered_columns
@CoreyS
or any other dataiker, should a ticket be created in the support page?Hope this helps!
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Amazing set of debuging on your part.
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Thanks @Ignacio_Toledo
i've reported the potential bug to the plugins team. -
I've added a letter at the beginning of the CUSTid column, it works!!
A big thank you @Ignacio_Toledo
for helpingXavier
-
ClemenceB Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Product Ideas Manager Posts: 18 Dataiker
Hi,
Thanks @Ignacio_Toledo
for helping on this! Indeed we have an issue when the identifier column is numerical, for long format. This will be fixed in the next release of the plugin (should be soon).