Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi all,
I'm running into a problem while setting up a BERT model script in a text classification task/flow. The Huggingface transformers take a path of a folder containing multiple files as input. This works fine when testing locally, but the architecture of Dataiku forces me to use managed folder in which the language model (BERT transformer) files reside and I see no easy way of giving the managed folder as input. Is there a common solution for this? Below is a solution I've tried:
from transformers import BertTokenizer, BertModel
import dataiku
#Path to the language model managed folder
LM_FOLDER_NAME = 'LM'
LM_FOLDER = dataiku.Folder(LM_FOLDER_NAME)
LM_PATH = LM_FOLDER.get_path()
tokenizer = BertTokenizer.from_pretrained(LM_PATH + "/output_large")
bert_model = BertModel.from_pretrained(LM_PATH + "/output_large")
However, the code above gives the following error:
--------------------------------------------------------------------------- PermissionError Traceback (most recent call last) <ipython-input-4-a97560aab45e> in <module> ----> 4 tokenizer = BertTokenizer.from_pretrained(LM_PATH + "/output_large") 5 bert_model = BertModel.from_pretrained(LM_PATH + "/output_large") /opt/dataiku/code-env/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs) 1668 local_files_only=local_files_only, 1669 use_auth_token=use_auth_token, -> 1670 user_agent=user_agent, 1671 ) 1672 /opt/dataiku/code-env/lib/python3.6/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only) 1171 user_agent=user_agent, 1172 use_auth_token=use_auth_token, -> 1173 local_files_only=local_files_only, 1174 ) 1175 elif os.path.exists(url_or_filename): /opt/dataiku/code-env/lib/python3.6/site-packages/transformers/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only) 1318 cache_dir = str(cache_dir) 1319 -> 1320 os.makedirs(cache_dir, exist_ok=True) 1321 1322 headers = {"user-agent": http_user_agent(user_agent)} /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 208 if head and tail and not path.exists(head): 209 try: --> 210 makedirs(head, mode, exist_ok) 211 except FileExistsError: 212 # Defeats race condition when another thread created the path /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 208 if head and tail and not path.exists(head): 209 try: --> 210 makedirs(head, mode, exist_ok) 211 except FileExistsError: 212 # Defeats race condition when another thread created the path /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 208 if head and tail and not path.exists(head): 209 try: --> 210 makedirs(head, mode, exist_ok) 211 except FileExistsError: 212 # Defeats race condition when another thread created the path /usr/lib64/python3.6/os.py in makedirs(name, mode, exist_ok) 218 return 219 try: --> 220 mkdir(name, mode) 221 except OSError: 222 # Cannot rely on checking for EEXIST, since the operating system PermissionError: [Errno 13] Permission denied: '/home/dssuser'
As get_download_stream() works for files specifically this is of little help as I need the entire folder -> 'output_large'. Any help would be welcome!
Thanks!
Hi,
the /home/dssuser isn't accessible to the UNIX users running recipes or notebooks in DSS, so you need to grant at least traversal up to the DSS datadir, starting with a `chmod 755 /home/dssuser`