Accessible location for cache folder

jurriaann
jurriaann Registered Posts: 6
edited July 16 in Using Dataiku

Hi! I might need our admin for this one, but hoped i could manage without... To be able to use the transformers package in a jupyter notebook / code node, a cache folder is needed. I tried to set the environment variable to the notebook specific location, in which i have permission to create a cache subfolder, but it is not found/used by the from_pretrained function. Probably it cant access it.

import shutil
import os

cwd = os.getcwd()
cachedir = cwd+'/cache'
os.mkdir(cachedir)
os.environ['HF_HOME'] = cachedir

from transformers import AutoTokenizer
production_tokenizer_model = 'GroNLP/bert-base-dutch-cased'
production_tokenizer = AutoTokenizer.from_pretrained(production_tokenizer_model,cache_dir = cachedir, max_len=512)

# this is not the specified cachedir!
PermissionError
: [Errno 13] Permission denied: '/home/dssuser_jurriaand42a9a50/.huggingface/token'

My question: is there a generic location in the dku data folder that is accessible by all so that i can use it as a cache location? Or do i need to ask our admin to create one that i can use...

Tagged:

Best Answer

  • jurriaann
    jurriaann Registered Posts: 6
    Answer ✓

    Hi Alex, thanks for your reply! With some support from your colleagues I found a way to get huggingface models and tokenizers loaded in a notebook, the trick was to add the parameter use_auth_token=False to the from_pretrained() function. Hence:

    tokenizer = AutoTokenizer.from_pretrained(checkpoint,max_len=512,use_auth_token=False)

    Still not totally sure why this blocks the running if set to (default) True, but as far as I understand from its meaning 'The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).' -source: https://huggingface.co/docs/transformers/main_classes/model this tries to write a token somewhere where it is not allowed (in our config, anyway). Since this is only relevant for private models, it's fine to set it to False in our setting.

Answers

  • jurriaann
    jurriaann Registered Posts: 6
    edited July 17

    Update: I've also tried to accomplish being able to load transformer models / tokenizers via init script of the code env and this runs perfectly fine, but unfortunately still get permission errors when I try to load a checkpoint in my notebook...

    PermissionError: [Errno 13] Permission denied: '/data/dataiku/dss_data/code-envs/resources/python/webapp_dash/huggingface/transformers/443c1d513d458927e5883e0b1298cdb70ba4d14a55faa236d93e0598efc78fc7.3b16931b59b9aafc3e068b6cd5f0be5e02a209a299e39b1e0056d89eaa3b6a7b.lock'

    Any suggestions how to (let our admin) enable using these models in notebooks?

    HF_INIT.PNG

  • jurriaann
    jurriaann Registered Posts: 6

    extra update: the permission error only exists when running code in a python notebook! when running as a code node, the loading goes fine...

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker

    Hi,

    You should be able to set the cache to the current working directory in a notebook.

    import os
    os.environ['TRANSFORMERS_CACHE'] = str(os.getcwd()) + '/transformers'
    os.environ['HUGGINGFACE_HUB_CACHE'] = str(os.getcwd()) + '/transformers'
    os.environ['HF_HOME'] = str(os.getcwd()) + '/transformers'
    os.environ['XDG_CACHE_HOME'] = str(os.getcwd()) + '/huggingface'

    To better understand why this would be failing in the first place in a notebook can you please confirm if :
    1. The notebook kernel was unloaded/reloaded or used the Option force reload after making the updates to the code env.
    2. Are the notebook running locally or in containerized execution?
    3. What DSS version are you currently on?

    Thanks

  • dfang
    dfang Registered Posts: 1

    PermissionError: [Errno 13] Permission denied: '/home/dssuser_********/.huggingface/token'

    I'm getting the same error, and changing the environment variable didn't work. It still pointed to the above directory. This is weird because I used to be able to load hugging face models in a notebook easily. Basically rerunning the same code in the same notebook under the same environment. We recently upgraded to version 12, though. Not sure whether this is related.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker

    Could you raise a support ticket for this issue, in current DSS releases we recommend loading hugging face models via resource scripts

    https://developer.dataiku.com/latest/tutorials/machine-learning/code-env-resources/hf-resources/index.html

Setup Info
    Tags
      Help me…