Refer to a managed folder as a folder instead of individual files

maarten98
Level 1
Refer to a managed folder as a folder instead of individual files

Hi all,

I'm running into a problem while setting up a BERT model script in a text classification task/flow. The Huggingface transformers take a path of a folder containing multiple files as input. This works fine when testing locally, but the architecture of Dataiku forces me to use managed folder in which the language model (BERT transformer) files reside and I see no easy way of giving the managed folder as input. Is there a common solution for this?  Below is a solution I've tried:

 

from transformers import BertTokenizer, BertModel
import dataiku

#Path to the language model managed folder
LM_FOLDER_NAME = 'LM'
LM_FOLDER = dataiku.Folder(LM_FOLDER_NAME)
LM_PATH = LM_FOLDER.get_path()

tokenizer = BertTokenizer.from_pretrained(LM_PATH + "/output_large")
bert_model = BertModel.from_pretrained(LM_PATH + "/output_large")

 

However, the code above gives the following error:

---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-4-a97560aab45e> in <module>
----> 4 tokenizer = BertTokenizer.from_pretrained(LM_PATH + "/output_large")
      5 bert_model = BertModel.from_pretrained(LM_PATH + "/output_large")

/opt/dataiku/code-env/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1668                         local_files_only=local_files_only,
   1669                         use_auth_token=use_auth_token,
-> 1670                         user_agent=user_agent,
   1671                     )
   1672 

/opt/dataiku/code-env/lib/python3.6/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
   1171             user_agent=user_agent,
   1172             use_auth_token=use_auth_token,
-> 1173             local_files_only=local_files_only,
   1174         )
   1175     elif os.path.exists(url_or_filename):

/opt/dataiku/code-env/lib/python3.6/site-packages/transformers/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
   1318         cache_dir = str(cache_dir)
   1319 
-> 1320     os.makedirs(cache_dir, exist_ok=True)
   1321 
   1322     headers = {"user-agent": http_user_agent(user_agent)}

/usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok)
    208     if head and tail and not path.exists(head):
    209         try:
--> 210             makedirs(head, mode, exist_ok)
    211         except FileExistsError:
    212             # Defeats race condition when another thread created the path

/usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok)
    208     if head and tail and not path.exists(head):
    209         try:
--> 210             makedirs(head, mode, exist_ok)
    211         except FileExistsError:
    212             # Defeats race condition when another thread created the path

/usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok)
    208     if head and tail and not path.exists(head):
    209         try:
--> 210             makedirs(head, mode, exist_ok)
    211         except FileExistsError:
    212             # Defeats race condition when another thread created the path

/usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok)
    218             return
    219     try:
--> 220         mkdir(name, mode)
    221     except OSError:
    222         # Cannot rely on checking for EEXIST, since the operating system

PermissionError: [Errno 13] Permission denied: '/home/dssuser'

As get_download_stream() works for files specifically this is of little help as I need the entire folder -> 'output_large'. Any help would be welcome!

Thanks!

0 Kudos
1 Reply
fchataigner2
Dataiker

Hi,

the /home/dssuser isn't accessible to the UNIX users running recipes or notebooks in DSS, so you need to grant at least traversal up to the DSS datadir, starting with a `chmod 755 /home/dssuser`

0 Kudos