How to load a pre-trained model into a codenv (Resources Directory) in a no-internet-access instance
Hi.
I am looking for using some pretrained model (for example embeddings model) within my project. The DSS instance I am working on cannot access Internet. Still i was able to retrieve the models at some point…and now I want to re-use them.
- I was also able to upload the model in a managed folder and use it in a code recipe (see
. and the code below). It works. However I understand that the proper way for using them (and sharing them with other users) is to include those models into a codenv, in the Resource Directory. - Retrieving pre trained models from Huggingface (see Load and re-use a Hugging Face model - Dataiku Developer Guide) or PyTorch Hub is not an option (no internet access)
- So i uploaded the model into a codenv / Resource directory and tried to modify the initialization script (working with HugginfFace) so that it would work with a model loaded in Resource Directory. I don't really understand this HF-version of the inialization script though… and as expected my modification didn't work. This is where I need help :-)
Here is what I tried :
And the error message I got.
Thank you in advance for your help.
Comments
-
This new post clearly relates to those 2 past ones
How to load a file to Resources directory and have available at run time — Dataiku Community
How to add a file to the Resources directory so that it is accessible at runtime — Dataiku Community
However i couldn't find an answer there.
Thanks again -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
Long story short each model is different and will have different ways of downloading and storing model files for offline use. These different ways may or may not be compatible with code env resources although most should work it depends on how the model needs to be setup for offline use.
I would advise that before you try to use code env resources you get the model working offline in a Jupyter Notebook and then explore whether this works on resources on a code env. Personally I don't like to use code env resources because that's only available in one code env. I have succesfully enabled lots offline models in Dataiku system wide. This is a much better approach since it keeps code envs lean, everyone uses the same version of the model and they don't need to use specific code envs. You can also usually find out lots of posts of people trying to achieve the same thing so it's easier to setup than using code env resources in Dataiku.