I wish to upload some files (more than 200MB) in Library Editor but encountered a
Can i increase the limit of this?
Context. I want to run a set of codes from a github project and in that project comes with folders that point to configuration and model weights. I would like to run off the existing project without requiring to recode it to use managed folders.
It is not currently possible to configure this limit. We'll add your request to our backlog.
Please note that this folder is under Git version control, so uploading very large files would not be very recommended.
[EDIT] I saw that the context is saving/loading models.
You have essentially two solutions to handle this:
1. Save/load the model on a shared local filesystem folder, for instance, /home/dataiku/shared_models
2. Save/load the model on a DSS managed folder: https://doc.dataiku.com/dss/latest/connecting/managed_folders.html
The advantages of 2 over 1 are:
- access control: you can choose to share the folder to a specific project
- remote storage: you can store the models on remote filesystems such as S3, GCS or Azure Blob Storage
- customization: the end-user may be able to tune or add their own models
- automation: the folder can be bundled to production environments: Automation and API nodes
Having said that, if you want to retain full control of the model and provide it to all end-users without them fiddling with it, then 1 may be the simplest solution. Be careful that it may create hard-to-trace dependencies when you move things to production.
Hope it helps,
Both your solutions are plausible but the objective is to perform quick testing of github projects, which typically involves cloning the Github project and running their python script directly. We would like to achieve this with Dataiku, which theoretically can be done by the following.
The above methods works only for local Dataiku instance though, once docker or kubernetes are used for container execution, i doubt step 2 will work as the path to the library editor would be very different.
Are there any other ways to use Dataiku to quickly test github projects?
Thanks, I understand better you setup. The core challenge is that git is not very well suited to store large files.
One thing I have noted when reading the code of Python packages which distribute large models (tensorflow.keras.applications, torch.utils.model_zoo, huggingface.transformers among many others) is that they never store models in their git repo. Instead, they have modules with references to remote object storage locations, handling the loading and caching of models.
Another possibility you can experiment with could be git-lfs (https://git-lfs.github.com/) but I don't know if it would play well with Docker/Kubernetes execution.