New to Dataiku DSS? Try out our NEW Quick Start Programs today and get onboarded on the product in just one hour! Let's go

PicklingError for Custom Preprocessing (Text)

Solved!
rmios
Level 3
PicklingError for Custom Preprocessing (Text)

I get an error when custom preprocessing a text features in a VisualML model:

Traceback (most recent call last):
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/server.py", line 47, in serve
[2020/07/14-14:42:34.999] [MRT-16917] [INFO] [dku.block.link.interaction]  - Check result for nullity exceptionIfNull=true result=null
    ret = api_command(arg)
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/dkuapi.py", line 45, in aux
    return api(**kwargs)
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/commands.py", line 311, in train_prediction_models_nosave
    preproc_handler.save_data()
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/preprocessing_handler.py", line 166, in save_data
    self._save_resource(resource_name)
  File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/preprocessing_handler.py", line 106, in _save_resource
    pickle.dump(resource, resource_file, 2)
_pickle.PicklingError: Can't pickle <class 'StringIdentity'>: attribute lookup StringIdentity on builtins failed

 
The processing snippet looks like this:

import numpy as np
import pandas as pd

class StringIdentity:
    def __init__(self, names=["DefaultName"]):
        self.names = names

    def fit(self, series):
        pass

    def transform(self, series):
        a = pd.DataFrame(series.map(lambda x : np.array([x])), columns=self.names)
        return a

processor = StringIdentity(["path"])


The purpose is to pass the unchanged string as an input feature.

What could be the issue here? How can I change to code that it works with pickle?

Thank you!

0 Kudos
1 Solution
Clément_Stenac
Dataiker
Dataiker

Hi,

The custom preprocessing can also be put in the per-project libraries editor (https://doc.dataiku.com/dss/latest/python/reusing-code.html#sharing-python-code-within-a-project)

However, from reading your code, am I correct that it would keep string data in the output ? A model can only be trained with purely numerical data, so your preprocessing needs to somehow encode the string to numericals

View solution in original post

0 Kudos
4 Replies
rmios
Level 3
Author

I understand now that I am supposed to put the class in a file in the "lib/python" directory (from https://doc.dataiku.com/dss/latest/machine-learning/features-handling/custom.html). Is this directory located on the local file system DSS DATA_DIR? How can a normal user (i.e. not an admin) add a custom preprocessor?

0 Kudos
Clément_Stenac
Dataiker
Dataiker

Hi,

The custom preprocessing can also be put in the per-project libraries editor (https://doc.dataiku.com/dss/latest/python/reusing-code.html#sharing-python-code-within-a-project)

However, from reading your code, am I correct that it would keep string data in the output ? A model can only be trained with purely numerical data, so your preprocessing needs to somehow encode the string to numericals

View solution in original post

0 Kudos
rmios
Level 3
Author

Thank you for your reply. I am developing a plugin for VisualML and it requires the filename as an input. BTW, should it be possible to create a module in the plugin python-lib and use it in the custom preprocessor?

I tried both described methods (first directory with module name with __init__.py inside and second a file named as the module name (stringidentity.py) in the root directoty) but alas none worked. I had it previously working with a global python file (DATA_DIR/lib/python/stringidentity.py) and after removing it and trying the per-project library it stopped working. Do I have to restart some services in order to register the per-project lib?

0 Kudos
rmios
Level 3
Author

I double checked to be sure and ran

import sys
for path in sys.path:
print("PATH: {}".format(path))

The result is:

PATH: 
PATH: /ws/dss/lib/python
PATH: /ws/dataiku-dss-7.0.2/python
PATH: /ws/dataiku-dss-7.0.2/dku-jupyter/packages
PATH: /ws/dss/tmp/ml-plugins-lib/159479944002615163898647754708507
PATH: /ws/dss/plugins/dev/visual-project/python-lib
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python36.zip
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/lib-dynload
PATH: /usr/lib/python3.6
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/site-packages
PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/site-packages/IPython/extensions

 So it seems that the plugin-lib should work but the project-lib does not?!

0 Kudos
A banner prompting to get Dataiku DSS