Accessible location for cache folder

Solved!
jurriaann
Level 2
Accessible location for cache folder

Hi! I might need our admin for this one, but hoped i could manage without... To be able to use the transformers package in a jupyter notebook / code node, a cache folder is needed. I tried to set the environment variable to the notebook specific location, in which i have permission to create a cache subfolder, but it is not found/used by the from_pretrained function. Probably it cant access it. 

import shutil
import os

cwd = os.getcwd()
cachedir = cwd+'/cache'
os.mkdir(cachedir)
os.environ['HF_HOME'] = cachedir

from transformers import AutoTokenizer
production_tokenizer_model = 'GroNLP/bert-base-dutch-cased'
production_tokenizer = AutoTokenizer.from_pretrained(production_tokenizer_model,cache_dir = cachedir, max_len=512)

# this is not the specified cachedir!
PermissionError
: [Errno 13] Permission denied: '/home/dssuser_jurriaand42a9a50/.huggingface/token'

 

My question: is there a generic location in the dku data folder that is accessible by all so that i can use it as a cache location? Or do i need to ask our admin to create one that i can use...

0 Kudos
1 Solution
jurriaann
Level 2
Author

Hi Alex, thanks for your reply! With some support from your colleagues I found a way to get huggingface models and tokenizers loaded in a notebook, the trick was to add the parameter use_auth_token=False to the from_pretrained() function. Hence:

tokenizer = AutoTokenizer.from_pretrained(checkpoint,max_len=512,use_auth_token=False)

Still not totally sure why this blocks the running if set to (default) True, but as far as I understand from its meaning 'The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).'  -source:  https://huggingface.co/docs/transformers/main_classes/model this tries to write a token somewhere where it is not allowed (in our config, anyway). Since this is only relevant for private models, it's fine to set it to False in our setting.  

View solution in original post

6 Replies
jurriaann
Level 2
Author

Update: I've also tried to accomplish being able to load transformer models / tokenizers via init script of the code env and this runs perfectly fine, but unfortunately still get permission errors when I try to load a checkpoint in my notebook...

PermissionError: [Errno 13] Permission denied: '/data/dataiku/dss_data/code-envs/resources/python/webapp_dash/huggingface/transformers/443c1d513d458927e5883e0b1298cdb70ba4d14a55faa236d93e0598efc78fc7.3b16931b59b9aafc3e068b6cd5f0be5e02a209a299e39b1e0056d89eaa3b6a7b.lock'

Any suggestions how to (let our admin) enable using these models in notebooks? 

 
 

HF_INIT.PNG

0 Kudos
jurriaann
Level 2
Author

extra update: the permission error only exists when running code in a python notebook! when running as a code node, the loading goes fine... 

0 Kudos
AlexT
Dataiker

Hi,

You should be able to set the cache to the current working directory in a notebook. 

import os
os.environ['TRANSFORMERS_CACHE'] = str(os.getcwd()) + '/transformers'
os.environ['HUGGINGFACE_HUB_CACHE'] = str(os.getcwd()) + '/transformers'
os.environ['HF_HOME'] = str(os.getcwd()) + '/transformers'
os.environ['XDG_CACHE_HOME'] = str(os.getcwd()) + '/huggingface'

To better understand why this would be failing in the first place in a notebook can you please confirm if :
1. The notebook kernel was unloaded/reloaded or used the Option force reload after making the updates to the code env. 
2. Are the notebook running locally or in containerized execution?
3. What DSS version are you currently on?

Thanks

 

0 Kudos
dfang
Level 1

PermissionError: [Errno 13] Permission denied: '/home/dssuser_********/.huggingface/token'

I'm getting the same error, and changing the environment variable didn't work. It still pointed to the above directory. This is weird because I used to be able to load hugging face models in a notebook easily. Basically rerunning the same code in the same notebook under the same environment. We recently upgraded to version 12, though. Not sure whether this is related.

0 Kudos
AlexT
Dataiker

Could you raise a support ticket for this issue, in current DSS releases we recommend loading hugging face models via resource scripts 

https://developer.dataiku.com/latest/tutorials/machine-learning/code-env-resources/hf-resources/inde...

0 Kudos
jurriaann
Level 2
Author

Hi Alex, thanks for your reply! With some support from your colleagues I found a way to get huggingface models and tokenizers loaded in a notebook, the trick was to add the parameter use_auth_token=False to the from_pretrained() function. Hence:

tokenizer = AutoTokenizer.from_pretrained(checkpoint,max_len=512,use_auth_token=False)

Still not totally sure why this blocks the running if set to (default) True, but as far as I understand from its meaning 'The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).'  -source:  https://huggingface.co/docs/transformers/main_classes/model this tries to write a token somewhere where it is not allowed (in our config, anyway). Since this is only relevant for private models, it's fine to set it to False in our setting.