PicklingError for Custom Preprocessing (Text)
I get an error when custom preprocessing a text features in a VisualML model:
Traceback (most recent call last): File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/server.py", line 47, in serve [2020/07/14-14:42:34.999] [MRT-16917] [INFO] [dku.block.link.interaction] - Check result for nullity exceptionIfNull=true result=null ret = api_command(arg) File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/dkuapi.py", line 45, in aux return api(**kwargs) File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/commands.py", line 311, in train_prediction_models_nosave preproc_handler.save_data() File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/preprocessing_handler.py", line 166, in save_data self._save_resource(resource_name) File "/ws/dataiku-dss-7.0.2/python/dataiku/doctor/preprocessing_handler.py", line 106, in _save_resource pickle.dump(resource, resource_file, 2) _pickle.PicklingError: Can't pickle <class 'StringIdentity'>: attribute lookup StringIdentity on builtins failed
The processing snippet looks like this:
import numpy as np import pandas as pd class StringIdentity: def __init__(self, names=["DefaultName"]): self.names = names def fit(self, series): pass def transform(self, series): a = pd.DataFrame(series.map(lambda x : np.array([x])), columns=self.names) return a processor = StringIdentity(["path"])
The purpose is to pass the unchanged string as an input feature.
What could be the issue here? How can I change to code that it works with pickle?
Thank you!
Best Answer
-
Hi,
The custom preprocessing can also be put in the per-project libraries editor (https://doc.dataiku.com/dss/latest/python/reusing-code.html#sharing-python-code-within-a-project)
However, from reading your code, am I correct that it would keep string data in the output ? A model can only be trained with purely numerical data, so your preprocessing needs to somehow encode the string to numericals
Answers
-
I understand now that I am supposed to put the class in a file in the "lib/python" directory (from https://doc.dataiku.com/dss/latest/machine-learning/features-handling/custom.html). Is this directory located on the local file system DSS DATA_DIR? How can a normal user (i.e. not an admin) add a custom preprocessor?
-
Thank you for your reply. I am developing a plugin for VisualML and it requires the filename as an input. BTW, should it be possible to create a module in the plugin python-lib and use it in the custom preprocessor?
I tried both described methods (first directory with module name with __init__.py inside and second a file named as the module name (stringidentity.py) in the root directoty) but alas none worked. I had it previously working with a global python file (DATA_DIR/lib/python/stringidentity.py) and after removing it and trying the per-project library it stopped working. Do I have to restart some services in order to register the per-project lib?
-
I double checked to be sure and ran
import sys
for path in sys.path:
print("PATH: {}".format(path))The result is:
PATH: PATH: /ws/dss/lib/python PATH: /ws/dataiku-dss-7.0.2/python PATH: /ws/dataiku-dss-7.0.2/dku-jupyter/packages PATH: /ws/dss/tmp/ml-plugins-lib/159479944002615163898647754708507 PATH: /ws/dss/plugins/dev/visual-project/python-lib PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python36.zip PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6 PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/lib-dynload PATH: /usr/lib/python3.6 PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/site-packages PATH: /ws/dss/code-envs/python/TensorFlow2/lib/python3.6/site-packages/IPython/extensions
So it seems that the plugin-lib should work but the project-lib does not?!