PermissionError while trying to run a python recipe
Hello,
I am trying to use a library called DocTR in dataiku. DocTR is, under the hood, running Deep Learning models to perform the different document text extraction. When I use the library in a jupyter notebook in dataiku, the pretrained resnet50 and vgg16 models are downloaded in the cache of the jupyter notebook and everything works fine.
But when I try to run the same code within a python script in dataiku, I get the PermissionError because my user, assumably does not have permission to download the pretrained models in the cache of the dataiku instance.
Is there any way I can get around this problem other than storing the pretrained models in another location and providing their paths to the DocTR?
Answers
-
Hi,
You should be able to workaround the error by exporting the DOCTR_CACHE_DIR , DOCTR_MULTIPROCESSING_DISABLE environment variables to a directory path, such as a directory outside of the DSS data dir and not in another user’s home_dir, that is readable and writable (i.e. chmod 777 permission) by all users. Ref. https://mindee.github.io/doctr/using_doctr/running_on_aws.html
For example, add the following to the Linux user profile of the dssuser (e.g. "~/.bash_profile", or "~/.bashrc") or to the "<DATA_DIR>/bin/env-site.sh" file, then restart DSS:
export DOCTR_MULTIPROCESSING_DISABLE=TRUEexport DOCTR_CACHE_DIR=/tmp
## Base imports
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var# Clears all environment variables defined by previously run script
clear_all_env_vars()## DocTR# Set DocTR cache directory
set_env_path("DOCTR_CACHE_DIR", "DOCTR_CACHE_DIR")
set_env_var("DOCTR_MULTIPROCESSING_DISABLE", "TRUE")# Import DocTR
from doctr.models import ocr_predictor.................