Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

PicklingError for Custom Preprocessing (Text)

Level 2
PicklingError for Custom Preprocessing (Text)

I get an error when custom preprocessing a text features in a VisualML model:

Traceback (most recent call last):
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/server.py", line 47, in serve
[2020/07/14-14:42:34.999] [MRT-16917] [INFO] [dku.block.link.interaction]  - Check result for nullity exceptionIfNull=true result=null
    ret = api_command(arg)
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/dkuapi.py", line 45, in aux
    return api(**kwargs)
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/commands.py", line 311, in train_prediction_models_nosave
    preproc_handler.save_data()
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/preprocessing_handler.py", line 166, in save_data
    self._save_resource(resource_name)
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/preprocessing_handler.py", line 106, in _save_resource
    pickle.dump(resource, resource_file, 2)
_pickle.PicklingError: Can't pickle <class 'StringIdentity'>: attribute lookup StringIdentity on builtins failed

 
The processing snippet looks like this:

import numpy as np
import pandas as pd

class StringIdentity:
    def __init__(self, names=["DefaultName"]):
        self.names = names

    def fit(self, series):
        pass

    def transform(self, series):
        a = pd.DataFrame(series.map(lambda x : np.array([x])), columns=self.names)
        return a

processor = StringIdentity(["path"])


The purpose is to pass the unchanged string as an input feature.

What could be the issue here? How can I change to code that it works with pickle?

Thank you!

0 Kudos
4 Replies
Level 2
Author

I understand now that I am supposed to put the class in a file in the "lib/python" directory (from https://doc.dataiku.com/dss/latest/machine-learning/features-handling/custom.html). Is this directory located on the local file system DSS DATA_DIR? How can a normal user (i.e. not an admin) add a custom preprocessor?

0 Kudos
Dataiker
Dataiker

Hi,

The custom preprocessing can also be put in the per-project libraries editor (https://doc.dataiku.com/dss/latest/python/reusing-code.html#sharing-python-code-within-a-project)

However, from reading your code, am I correct that it would keep string data in the output ? A model can only be trained with purely numerical data, so your preprocessing needs to somehow encode the string to numericals

0 Kudos
Level 2
Author

Thank you for your reply. I am developing a plugin for VisualML and it requires the filename as an input. BTW, should it be possible to create a module in the plugin python-lib and use it in the custom preprocessor?

I tried both described methods (first directory with module name with __init__.py inside and second a file named as the module name (stringidentity.py) in the root directoty) but alas none worked. I had it previously working with a global python file (DATA_DIR/lib/python/stringidentity.py) and after removing it and trying the per-project library it stopped working. Do I have to restart some services in order to register the per-project lib?

0 Kudos
Level 2
Author

I double checked to be sure and ran

import sys
for path in sys.path:
print("PATH: {}".format(path))

The result is:

PATH: 
PATH: /ws/dss/lib/python
PATH: /ws/dataiku-dss-7.0.2/python
PATH: /ws/dataiku-dss-7.0.2/dku-jupyter/packages
PATH: /ws/dss/tmp/ml-plugins-lib/159479944002615163898647754708507
PATH: /ws/dss/plugins/dev/visual-project/python-lib
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python36.zip
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/lib-dynload
PATH: /usr/lib/python3.6
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/site-packages
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/site-packages/IPython/extensions

 So it seems that the plugin-lib should work but the project-lib does not?!

0 Kudos