Saving downloaded model

NR
Level 3
Saving downloaded model

Hi,

I'm trying to use transformers in a python reciepe.

I need to define a cache folder where to save downloaded model to avoid downloading each time.

How to define cache_dir ? Is it user resouces folder and how to access it from code ? 

Here is a sample code :

from transformers import AutoModelForSequenceClassification

model_name = "bert-base-uncased"
cache_dir = "/path/to/cache/dir"

model = AutoModelForSequenceClassification.from_pretrained(model_name, cache_dir=cache_dir)

 Thanks

0 Kudos
6 Replies
MiguelangelC
Dataiker

Hi,

The recommended way to set a cache dir for Hugging face transformers is to use a resource initialisation script on the code environment being used.

Go to Administration > Code Envs > <Select code env used on your recipe> >Resources.

Here, there is a code sample for Hugging Face that you can use (the exact code depends on the DSS version)

 

## Base imports
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")

# Import Hugging Face's transformers
import transformers

# Download pretrained model: automatically managed by Hugging Face,
# does not download anything if model is already in HF_HOME
model = transformers.DistilBertModel.from_pretrained("distilbert-base-uncased")

 

 

The "distilbert-base-uncased" model is downloaded by default as an example. You can add at this location the models you want to cache. They'll be saved to DATA_DIR/code-envs/resources/python/<code-env name>/huggingface/hub

 

 

 

0 Kudos
danb
Level 2

I am trying to do the same but for SentenceTransformers, 

from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var
from sentence_transformers import SentenceTransformer

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")

# Import Hugging Face's transformers
import transformers

# Download pretrained model: automatically managed by Hugging Face,
# does not download anything if model is already in HF_HOME

model = transformers.DistilBertModel.from_pretrained("distilbert-base-uncased")
model = transformers.MPNetModel.from_pretrained("microsoft/mpnet-base")
model = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-dot-v1")

but then every time I call the model from my script in the flow, the sentence transformer model is downloaded again and again, while the transformer model is picked up without needing a new download. Is there something that I am missing?

0 Kudos
NR
Level 3
Author

Thank for your answer. I'm don't have admin access.  Is it mandatory to have it.

I can see on my profile a use ressources tab. Is it possible to use this folder as a cache ?

Thanks for your assistance.profile.png 

0 Kudos
RexWescott
Level 1

You do not need to have administrator rights to use Dataiku resources. However, if you want to install Dataiku on your computer or set it up to access external data sources, you may need administrator rights. Regarding using the resource usage folder as a cache, it is possible but not recommended. The Resource Usage folder is for storing data related to your projects in Dataiku. If you use it as a cache, you may experience problems accessing the data you need while working with Dataiku. To save the downloaded model to Dataiku, you should use the model export function. To do this, select the appropriate model file in the project menu and select "Export Model". You can then save the model in the desired format and import it in another project or application.

0 Kudos
danb
Level 2

Hello Rex, 

Thank you for your answer - could you be more clear on how to save the model though? Given that the model is not a "recipe" one, but it is a model, loaded in Python because it is fetched from sentence transformers sitting somewhere on Huggingface... 

 

0 Kudos
Kanyewesttshirt
Level 1

1 Initial Revenue Date

2 Revenue End Date

3 Daily earnings over the specified time period

I want to determine the total daily revenue at for any day. Is there a plug in or maybe a python code that could enable me to run this "multi time series"?

0 Kudos