Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am trying to use a library called DocTR in dataiku. DocTR is, under the hood, running Deep Learning models to perform the different document text extraction. When I use the library in a jupyter notebook in dataiku, the pretrained resnet50 and vgg16 models are downloaded in the cache of the jupyter notebook and everything works fine.
But when I try to run the same code within a python script in dataiku, I get the PermissionError because my user, assumably does not have permission to download the pretrained models in the cache of the dataiku instance.
Is there any way I can get around this problem other than storing the pretrained models in another location and providing their paths to the DocTR?
You should be able to workaround the error by exporting the DOCTR_CACHE_DIR , DOCTR_MULTIPROCESSING_DISABLE environment variables to a directory path, such as a directory outside of the DSS data dir and not in another user’s home_dir, that is readable and writable (i.e. chmod 777 permission) by all users. Ref. https://mindee.github.io/doctr/using_doctr/running_on_aws.html
For example, add the following to the Linux user profile of the dssuser (e.g. "~/.bash_profile", or "~/.bashrc") or to the "<DATA_DIR>/bin/env-site.sh" file, then restart DSS:
export DOCTR_MULTIPROCESSING_DISABLE=TRUE export DOCTR_CACHE_DIR=/tmp
## Base imports
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var# Clears all environment variables defined by previously run script
clear_all_env_vars()## DocTR# Set DocTR cache directory
set_env_var("DOCTR_MULTIPROCESSING_DISABLE", "TRUE")# Import DocTR
from doctr.models import ocr_predictor