Refer to a managed folder as a folder instead of individual files

Hi all,
I'm running into a problem while setting up a BERT model script in a text classification task/flow. The Huggingface transformers take a path of a folder containing multiple files as input. This works fine when testing locally, but the architecture of Dataiku forces me to use managed folder in which the language model (BERT transformer) files reside and I see no easy way of giving the managed folder as input. Is there a common solution for this? Below is a solution I've tried:
--------------------------------------------------------------------------- PermissionError Traceback (most recent call last) <ipython-input-4-a97560aab45e> in <module> ----> 4 tokenizer = BertTokenizer.from_pretrained(LM_PATH + "/output_large") 5 bert_model = BertModel.from_pretrained(LM_PATH + "/output_large") /opt/dataiku/code-env/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs) 1668 local_files_only=local_files_only, 1669 use_auth_token=use_auth_token, -> 1670 user_agent=user_agent, 1671 ) 1672 /opt/dataiku/code-env/lib/python3.6/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only) 1171 user_agent=user_agent, 1172 use_auth_token=use_auth_token, -> 1173 local_files_only=local_files_only, 1174 ) 1175 elif os.path.exists(url_or_filename): /opt/dataiku/code-env/lib/python3.6/site-packages/transformers/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only) 1318 cache_dir = str(cache_dir) 1319 -> 1320 os.makedirs(cache_dir, exist_ok=True) 1321 1322 headers = {"user-agent": http_user_agent(user_agent)} /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 208 if head and tail and not path.exists(head): 209 try: --> 210 makedirs(head, mode, exist_ok) 211 except FileExistsError: 212 # Defeats race condition when another thread created the path /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 208 if head and tail and not path.exists(head): 209 try: --> 210 makedirs(head, mode, exist_ok) 211 except FileExistsError: 212 # Defeats race condition when another thread created the path /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 208 if head and tail and not path.exists(head): 209 try: --> 210 makedirs(head, mode, exist_ok) 211 except FileExistsError: 212 # Defeats race condition when another thread created the path /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 218 return 219 try: --> 220 mkdir(name, mode) 221 except OSError: 222 # Cannot rely on checking for EEXIST, since the operating system PermissionError: [Errno 13] Permission denied: '/home/dssuser'
However, the code above gives the following error:
from transformers import BertTokenizer, BertModel import dataiku #Path to the language model managed folder LM_FOLDER_NAME = 'LM' LM_FOLDER = dataiku.Folder(LM_FOLDER_NAME) LM_PATH = LM_FOLDER.get_path() tokenizer = BertTokenizer.from_pretrained(LM_PATH + "/output_large") bert_model = BertModel.from_pretrained(LM_PATH + "/output_large")
As get_download_stream() works for files specifically this is of little help as I need the entire folder -> 'output_large'. Any help would be welcome!
Thanks!
Answers
-
Hi,
the /home/dssuser isn't accessible to the UNIX users running recipes or notebooks in DSS, so you need to grant at least traversal up to the DSS datadir, starting with a `chmod 755 /home/dssuser`