Time series resampling 'KeyError'

valentinaprotti
valentinaprotti Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Registered Posts: 2 ✭✭✭

Hi everyone, first post in search for help I am currently working on a time series dataset, to prepare it for the new time series forecast plugin. In short, as a result of a sequence of recipes I have obtained a dataset with:

  • a parsed date (date), tprelievoCattura3.PNG
  • a product code (integer), prodid
  • a pickup q.ty (decimal)
  • a requested q.ty (decimal)

On the right is a little preview of the dataset.

It should be a simple multiple multivariate time series, with prodid being the identifier. I am now trying to resample it with the time series preparation plugin to have one record per time series per week, because dates are unevenly distanced, and the following is the resampling recipe setup that gives me the error:

Cattura.PNG

After running this recipe, in a few seconds the following error appears:

Cattura2.PNG

I cannot understand its meaning. Can it be caused by prodid being integer (all the examples had a string column as the identifier)? Or is there something else I'm overlooking?

Please let me know if I should share the whole log too for better understanding. Thanks in advance

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @valentinaprotti
    ,

    Welcome to the community!

    Could you share the whole error message? In your screenshot is missing the part where the python error is displayed. With that information we could help you more!

    Cheers

  • xav56
    xav56 Registered Posts: 2 ✭✭✭✭
    edited July 17

    I have the exact same issue as @valentinaprotti
    , I don't know if you were able to solve it?

    i have simplified my table to 3 columns:

    • weekParsed (containing date type that are supposed to be weekly, but some weeks are missing),
    • CUSTid which is a customer ID as integer, that I'd like to use as "long format" in the table
    • REAL which is the sales value i'm trying to predict. I'd just like to add '0' where there is no value for a missing week

    do you have any idea ?

    here is the error log:

    (......)
    [2021/04/10-17:25:20.299] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,296 INFO Dataiku Python entrypoint starting up
    [2021/04/10-17:25:20.300] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,296 INFO executable = /home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python
    [2021/04/10-17:25:20.302] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,296 INFO argv = ['/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/python-exec-wrapper.py', '/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/script.py']
    [2021/04/10-17:25:20.304] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,297 INFO --------------------
    [2021/04/10-17:25:20.304] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,297 INFO Looking for RemoteRunEnvDef in ./remote-run-env-def.json
    [2021/04/10-17:25:20.305] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,297 INFO Found RemoteRunEnvDef environment: ./remote-run-env-def.json 
    [2021/04/10-17:25:20.306] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,298 INFO Running a DSS Python recipe locally, uinsetting env
    [2021/04/10-17:25:20.307] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,299 INFO Setup complete, ready to execute Python code
    [2021/04/10-17:25:20.309] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,300 INFO Sys path: ['/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy', '/home/dataiku/dss/lib/python', '/home/dataiku/dataiku-dss-9.0.1/python', '/usr/lib64/python36.zip', '/usr/lib64/python3.6', '/usr/lib64/python3.6/lib-dynload', '/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages', '/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib/python3.6/site-packages', '/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/localconfig/projects/LPR1/lib/python', '/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib']
    [2021/04/10-17:25:20.311] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:20,300 INFO Script file: /home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/script.py
    [2021/04/10-17:25:24.773] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:24,772 INFO Computing for group: 1
    [2021/04/10-17:25:24.781] [null-err-36] [INFO] [dku.utils]  - *************** Recipe code failed **************
    [2021/04/10-17:25:24.782] [null-err-36] [INFO] [dku.utils]  - Begin Python stack
    [2021/04/10-17:25:24.796] [null-err-36] [INFO] [dku.utils]  - Traceback (most recent call last):
    [2021/04/10-17:25:24.796] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    [2021/04/10-17:25:24.797] [null-err-36] [INFO] [dku.utils]  -     return self._engine.get_loc(key)
    [2021/04/10-17:25:24.797] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
    [2021/04/10-17:25:24.798] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
    [2021/04/10-17:25:24.798] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
    [2021/04/10-17:25:24.799] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
    [2021/04/10-17:25:24.800] [null-err-36] [INFO] [dku.utils]  - KeyError: 'CUSTid'
    [2021/04/10-17:25:24.801] [null-err-36] [INFO] [dku.utils]  - During handling of the above exception, another exception occurred:
    [2021/04/10-17:25:24.802] [null-err-36] [INFO] [dku.utils]  - Traceback (most recent call last):
    [2021/04/10-17:25:24.803] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/python-exec-wrapper.py", line 206, in <module>
    [2021/04/10-17:25:24.804] [null-err-36] [INFO] [dku.utils]  -     exec(f.read())
    [2021/04/10-17:25:24.805] [null-err-36] [INFO] [dku.utils]  -   File "<string>", line 23, in <module>
    [2021/04/10-17:25:24.806] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/resampling.py", line 87, in transform
    [2021/04/10-17:25:24.806] [null-err-36] [INFO] [dku.utils]  -     group_resampled = self._resample(group.drop(groupby_columns, axis=1), datetime_column, columns_to_resample, reference_time_index, df_id=group_id)
    [2021/04/10-17:25:24.807] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/resampling.py", line 127, in _resample
    [2021/04/10-17:25:24.808] [null-err-36] [INFO] [dku.utils]  -     filtered_columns_to_resample = filter_empty_columns(df, columns_to_resample)
    [2021/04/10-17:25:24.810] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/plugins/installed/timeseries-preparation/python-lib/dku_timeseries/dataframe_helpers.py", line 22, in filter_empty_columns
    [2021/04/10-17:25:24.811] [null-err-36] [INFO] [dku.utils]  -     if np.sum(df[col].notnull()) > 1: # in fact we filter out columns with less than 2 values
    [2021/04/10-17:25:24.811] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    [2021/04/10-17:25:24.812] [null-err-36] [INFO] [dku.utils]  -     return self._getitem_column(key)
    [2021/04/10-17:25:24.813] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    [2021/04/10-17:25:24.813] [null-err-36] [INFO] [dku.utils]  -     return self._get_item_cache(key)
    [2021/04/10-17:25:24.814] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    [2021/04/10-17:25:24.814] [null-err-36] [INFO] [dku.utils]  -     values = self._data.get(item)
    [2021/04/10-17:25:24.815] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
    [2021/04/10-17:25:24.815] [null-err-36] [INFO] [dku.utils]  -     loc = self.items.get_loc(item)
    [2021/04/10-17:25:24.816] [null-err-36] [INFO] [dku.utils]  -   File "/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    [2021/04/10-17:25:24.817] [null-err-36] [INFO] [dku.utils]  -     return self._engine.get_loc(self._maybe_cast_indexer(key))
    [2021/04/10-17:25:24.818] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
    [2021/04/10-17:25:24.819] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
    [2021/04/10-17:25:24.819] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
    [2021/04/10-17:25:24.820] [null-err-36] [INFO] [dku.utils]  -   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
    [2021/04/10-17:25:24.820] [null-err-36] [INFO] [dku.utils]  - KeyError: 'CUSTid'
    [2021/04/10-17:25:24.821] [null-err-36] [INFO] [dku.utils]  - End Python stack[2021/04/10-17:25:24.821] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:24,798 INFO Check if spark is available
    [2021/04/10-17:25:24.822] [null-err-36] [INFO] [dku.utils]  - 2021-04-10 17:25:24,800 INFO Not stopping a spark context: No module named 'pyspark'
    [2021/04/10-17:25:24.977] [FRT-33-FlowRunnable] [WARN] [dku.resource] - stat file for pid 2899 does not exist. Process died?
    [2021/04/10-17:25:24.979] [FRT-33-FlowRunnable] [INFO] [dku.resourceusage] - Reporting completion of CRU:{"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"LPR1","jobId":"Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032","activityId":"compute_deliveryWeeklyTime_NP","activityType":"recipe","recipeType":"CustomCode_timeseries-preparation-resampling","recipeName":"compute_deliveryWeeklyTime"},"type":"LOCAL_PROCESS","id":"VbJeXPAfdOBWDzPc","startTime":1618075520039,"localProcess":{"pid":2899,"commandName":"/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python","cpuUserTimeMS":40,"cpuSystemTimeMS":10,"cpuChildrenUserTimeMS":0,"cpuChildrenSystemTimeMS":0,"cpuTotalMS":50,"cpuCurrent":0.0,"vmSizeMB":121,"vmRSSMB":4,"vmHWMMB":4,"vmRSSAnonMB":1,"vmDataMB":1,"vmSizePeakMB":121,"vmRSSPeakMB":4,"vmRSSTotalMBS":0,"majorFaults":3,"childrenMajorFaults":0}}
    [2021/04/10-17:25:24.980] [FRT-33-FlowRunnable] [INFO] [dku.usage.computeresource.jek] - Reporting completion of resource usage: {"context":{"type":"JOB_ACTIVITY","authIdentifier":"admin","projectKey":"LPR1","jobId":"Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032","activityId":"compute_deliveryWeeklyTime_NP","activityType":"recipe","recipeType":"CustomCode_timeseries-preparation-resampling","recipeName":"compute_deliveryWeeklyTime"},"type":"LOCAL_PROCESS","id":"VbJeXPAfdOBWDzPc","startTime":1618075520039,"endTime":1618075524979,"localProcess":{"pid":2899,"commandName":"/home/dataiku/dss/code-envs/python/plugin_timeseries-preparation_managed/bin/python","cpuUserTimeMS":40,"cpuSystemTimeMS":10,"cpuChildrenUserTimeMS":0,"cpuChildrenSystemTimeMS":0,"cpuTotalMS":50,"cpuCurrent":0.0,"vmSizeMB":121,"vmRSSMB":4,"vmHWMMB":4,"vmRSSAnonMB":1,"vmDataMB":1,"vmSizePeakMB":121,"vmRSSPeakMB":4,"vmRSSTotalMBS":0,"majorFaults":3,"childrenMajorFaults":0}}
    [2021/04/10-17:25:24.981] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Error file found, trying to throw it: /home/dataiku/dss/jobs/LPR1/Build_deliveryWeeklyTime__NP__2021-04-10T17-25-09.032/compute_deliveryWeeklyTime_NP/custom-python-recipe/pyout82HiNcVqdaKy/error.json
    [2021/04/10-17:25:24.981] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Raw error is{"errorType":"\u003cclass \u0027KeyError\u0027\u003e","message":"CUSTid","detailedMessage":"At line 23: \u003cclass \u0027KeyError\u0027\u003e: CUSTid","stackTrace":[]}
    [2021/04/10-17:25:24.982] [FRT-33-FlowRunnable] [INFO] [dip.exec.resultHandler] - Now err: {"errorType":"\u003cclass \u0027KeyError\u0027\u003e","message":"Error in Python process: CUSTid","detailedMessage":"Error in Python process: At line 23: \u003cclass \u0027KeyError\u0027\u003e: CUSTid","stackTrace":[]}
    [2021/04/10-17:25:24.987] [FRT-33-FlowRunnable] [INFO] [dku.flow.activity] - Run thread failed for activity compute_deliveryWeeklyTime_NP
    (.........)

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17

    Hi @xav56
    ,

    Thanks for sharing the complete log! It took some time but I was able to reproduce the problem.

    Summary: There is a bug in the code that should be reported to the plugin owners (I think they are from dataiku).

    Quick work around: transform the column "CUSTid" or whatever column you would like to use to do the grouping for the long format to a string. It must contain one character other than a digit.

    Detailed answer: There is bug that happens when the column to be used for the long format could be recognized by pandas as either an integer or a double. When the column is a string the plugin work as expected.

    The problem is in the python script "dku_timeseries/resampling.py", within the method "transform" of the "Resample" class, line 87:

    columns_to_resample = [col for col in df_copy.select_dtypes([int, float]).columns.tolist() if col != datetime_column]

    So, any column which is either an integer or a float is assumed to be a column that will need to be resample. In your case, and in the case of @valentinaprotti
    , the columns to be used for the long format grouping are integers, and so they are wrongly detected as "columns_to_resample".

    The bug raises when later in the process we come into the resampling algorithm, which will work with a dataframe where the grouping column is no longer present (because it was used to create the groups), but the list "columns_to_resample" still contains it:

    filtered_columns_to_resample = filter_empty_columns(df, columns_to_resample)
    ....
    # filter_empty_columns function is defined in another file, but here it is
    
    def filter_empty_columns(df, columns):
        filtered_columns = []
        for col in columns: # <- columns contain a column not longer available in df
            if np.sum(df[col].notnull()) > 1: # in fact we filter out columns with less than 2 values
                filtered_columns.append(col)
        return filtered_columns

    @CoreyS
    or any other dataiker, should a ticket be created in the support page?

    Hope this helps!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Ignacio_Toledo
    ,

    Amazing set of debuging on your part.

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Thanks @Ignacio_Toledo
    i've reported the potential bug to the plugins team.

  • xav56
    xav56 Registered Posts: 2 ✭✭✭✭

    I've added a letter at the beginning of the CUSTid column, it works!!

    A big thank you @Ignacio_Toledo
    for helping

    Xavier

  • ClemenceB
    ClemenceB Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Product Ideas Manager Posts: 18 Dataiker

    Hi,
    Thanks @Ignacio_Toledo
    for helping on this! Indeed we have an issue when the identifier column is numerical, for long format. This will be fixed in the next release of the plugin (should be soon).

Setup Info
    Tags
      Help me…