Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi! I might need our admin for this one, but hoped i could manage without... To be able to use the transformers package in a jupyter notebook / code node, a cache folder is needed. I tried to set the environment variable to the notebook specific location, in which i have permission to create a cache subfolder, but it is not found/used by the from_pretrained function. Probably it cant access it.
import shutil
import os
cwd = os.getcwd()
cachedir = cwd+'/cache'
os.mkdir(cachedir)
os.environ['HF_HOME'] = cachedir
from transformers import AutoTokenizer
production_tokenizer_model = 'GroNLP/bert-base-dutch-cased'
production_tokenizer = AutoTokenizer.from_pretrained(production_tokenizer_model,cache_dir = cachedir, max_len=512)
# this is not the specified cachedir!
PermissionError: [Errno 13] Permission denied: '/home/dssuser_jurriaand42a9a50/.huggingface/token'
My question: is there a generic location in the dku data folder that is accessible by all so that i can use it as a cache location? Or do i need to ask our admin to create one that i can use...
Hi Alex, thanks for your reply! With some support from your colleagues I found a way to get huggingface models and tokenizers loaded in a notebook, the trick was to add the parameter use_auth_token=False to the from_pretrained() function. Hence:
tokenizer = AutoTokenizer.from_pretrained(checkpoint,max_len=512,use_auth_token=False)
Still not totally sure why this blocks the running if set to (default) True, but as far as I understand from its meaning 'The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).' -source: https://huggingface.co/docs/transformers/main_classes/model this tries to write a token somewhere where it is not allowed (in our config, anyway). Since this is only relevant for private models, it's fine to set it to False in our setting.
Update: I've also tried to accomplish being able to load transformer models / tokenizers via init script of the code env and this runs perfectly fine, but unfortunately still get permission errors when I try to load a checkpoint in my notebook...
PermissionError: [Errno 13] Permission denied: '/data/dataiku/dss_data/code-envs/resources/python/webapp_dash/huggingface/transformers/443c1d513d458927e5883e0b1298cdb70ba4d14a55faa236d93e0598efc78fc7.3b16931b59b9aafc3e068b6cd5f0be5e02a209a299e39b1e0056d89eaa3b6a7b.lock'
Any suggestions how to (let our admin) enable using these models in notebooks?
extra update: the permission error only exists when running code in a python notebook! when running as a code node, the loading goes fine...
Hi,
You should be able to set the cache to the current working directory in a notebook.
import os
os.environ['TRANSFORMERS_CACHE'] = str(os.getcwd()) + '/transformers'
os.environ['HUGGINGFACE_HUB_CACHE'] = str(os.getcwd()) + '/transformers'
os.environ['HF_HOME'] = str(os.getcwd()) + '/transformers'
os.environ['XDG_CACHE_HOME'] = str(os.getcwd()) + '/huggingface'
To better understand why this would be failing in the first place in a notebook can you please confirm if :
1. The notebook kernel was unloaded/reloaded or used the Option force reload after making the updates to the code env.
2. Are the notebook running locally or in containerized execution?
3. What DSS version are you currently on?
Thanks
Hi Alex, thanks for your reply! With some support from your colleagues I found a way to get huggingface models and tokenizers loaded in a notebook, the trick was to add the parameter use_auth_token=False to the from_pretrained() function. Hence:
tokenizer = AutoTokenizer.from_pretrained(checkpoint,max_len=512,use_auth_token=False)
Still not totally sure why this blocks the running if set to (default) True, but as far as I understand from its meaning 'The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).' -source: https://huggingface.co/docs/transformers/main_classes/model this tries to write a token somewhere where it is not allowed (in our config, anyway). Since this is only relevant for private models, it's fine to set it to False in our setting.