Saving downloaded model

Options
NR
NR Registered Posts: 14

Hi,

I'm trying to use transformers in a python reciepe.

I need to define a cache folder where to save downloaded model to avoid downloading each time.

How to define cache_dir ? Is it user resouces folder and how to access it from code ?

Here is a sample code :

from transformers import AutoModelForSequenceClassification

model_name = "bert-base-uncased"
cache_dir = "/path/to/cache/dir"

model = AutoModelForSequenceClassification.from_pretrained(model_name, cache_dir=cache_dir)

Thanks

Answers

  • Miguel Angel
    Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker
    edited July 17
    Options

    Hi,

    The recommended way to set a cache dir for Hugging face transformers is to use a resource initialisation script on the code environment being used.

    Go to Administration > Code Envs > <Select code env used on your recipe> >Resources.

    Here, there is a code sample for Hugging Face that you can use (the exact code depends on the DSS version)

    ## Base imports
    from dataiku.code_env_resources import clear_all_env_vars
    from dataiku.code_env_resources import set_env_path
    from dataiku.code_env_resources import set_env_var
    
    # Clears all environment variables defined by previously run script
    clear_all_env_vars()
    
    ## Hugging Face
    # Set HuggingFace cache directory
    set_env_path("HF_HOME", "huggingface")
    
    # Import Hugging Face's transformers
    import transformers
    
    # Download pretrained model: automatically managed by Hugging Face,
    # does not download anything if model is already in HF_HOME
    model = transformers.DistilBertModel.from_pretrained("distilbert-base-uncased")

    The "distilbert-base-uncased" model is downloaded by default as an example. You can add at this location the models you want to cache. They'll be saved to DATA_DIR/code-envs/resources/python/<code-env name>/huggingface/hub

  • NR
    NR Registered Posts: 14
    Options

    Thank for your answer. I'm don't have admin access. Is it mandatory to have it.

    I can see on my profile a use ressources tab. Is it possible to use this folder as a cache ?

    Thanks for your assistance.profile.png

  • Kanyewesttshirt
    Kanyewesttshirt Registered Posts: 1
    Options

    1 Initial Revenue Date

    2 Revenue End Date

    3 Daily earnings over the specified time period

    I want to determine the total daily revenue at for any day. Is there a plug in or maybe a python code that could enable me to run this "multi time series"?

  • danb
    danb Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3
    edited July 17
    Options

    I am trying to do the same but for SentenceTransformers,

    from dataiku.code_env_resources import clear_all_env_vars
    from dataiku.code_env_resources import set_env_path
    from dataiku.code_env_resources import set_env_var
    from sentence_transformers import SentenceTransformer
    
    # Clears all environment variables defined by previously run script
    clear_all_env_vars()
    
    ## Hugging Face
    # Set HuggingFace cache directory
    set_env_path("HF_HOME", "huggingface")
    
    # Import Hugging Face's transformers
    import transformers
    
    # Download pretrained model: automatically managed by Hugging Face,
    # does not download anything if model is already in HF_HOME
    
    model = transformers.DistilBertModel.from_pretrained("distilbert-base-uncased")
    model = transformers.MPNetModel.from_pretrained("microsoft/mpnet-base")
    model = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-dot-v1")

    but then every time I call the model from my script in the flow, the sentence transformer model is downloaded again and again, while the transformer model is picked up without needing a new download. Is there something that I am missing?

  • RexWescott
    RexWescott Registered Posts: 1
    Options

    You do not need to have administrator rights to use Dataiku resources. However, if you want to install Dataiku on your computer or set it up to access external data sources, you may need administrator rights. Regarding using the resource usage folder as a cache, it is possible but not recommended. The Resource Usage folder is for storing data related to your projects in Dataiku. If you use it as a cache, you may experience problems accessing the data you need while working with Dataiku. To save the downloaded model to Dataiku, you should use the model export function. To do this, select the appropriate model file in the project menu and select "Export Model". You can then save the model in the desired format and import it in another project or application.

  • danb
    danb Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3
    Options

    Hello Rex,

    Thank you for your answer - could you be more clear on how to save the model though? Given that the model is not a "recipe" one, but it is a model, loaded in Python because it is fetched from sentence transformers sitting somewhere on Huggingface...

Setup Info
    Tags
      Help me…