Buffer size exceeded when uploading file in library editor

jax79sg
Level 2
Buffer size exceeded when uploading file in library editor

Hi, 

I wish to upload some files (more than 200MB)  in Library Editor but encountered a 

Buffer size exceeded error.

Can i increase the limit of this?

Context. I want to run a set of codes from a github project and in that project comes with folders that point to configuration and model weights. I would like to run off the existing project without requiring to recode it to use managed folders.

0 Kudos
5 Replies
Clรฉment_Stenac

Hi,

It is not currently possible to configure this limit. We'll add your request to our backlog.

Please note that this folder is under Git version control, so uploading very large files would not be very recommended.

 

0 Kudos
jax79sg
Level 2
Author

Hi, 

Thanks for the response, what would you otherwise recommend to fulfil the context? 

Regards,

Jax

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

[EDIT] I saw that the context is saving/loading models.

You have essentially two solutions to handle this:

1. Save/load the model on a shared local filesystem folder, for instance, /home/dataiku/shared_models

2. Save/load the model on a DSS managed folder: https://doc.dataiku.com/dss/latest/connecting/managed_folders.html

The advantages of 2 over 1 are:

- access control: you can choose to share the folder to a specific project

- remote storage: you can store the models on remote filesystems such as S3, GCS or Azure Blob Storage

- customization: the end-user may be able to tune or add their own models

- automation: the folder can be bundled to production environments:  Automation and API nodes

Having said that, if you want to retain full control of the model and provide it to all end-users without them fiddling with it, then 1 may be the simplest solution. Be careful that it may create hard-to-trace dependencies when you move things to production.

Hope it helps,

Alex

0 Kudos
jax79sg
Level 2
Author

Hi, 

Both your solutions are plausible but the objective is to perform quick testing of github projects, which typically involves cloning the Github project and running their python script directly. We would like to achieve this with Dataiku, which theoretically can be done by the following.

  1. Clone the repo in Library Editor
  2. Make sure all references to files in the projects are replaced with the full path to the Library Editor (E.g. /Users/jax/Library/DataScienceStudio/dss_home/config/projects/TFYOLO/lib/python/). (This is where the size limit breaks the bank)

The above methods works only for local Dataiku instance though, once docker or kubernetes are used for container execution, i doubt step 2 will work as the path to the library editor would be very different.

Are there any other ways to use Dataiku to quickly test github projects?

Regards,

Jax

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

Thanks, I understand better you setup. The core challenge is that git is not very well suited to store large files.

One thing I have noted when reading the code of Python packages which distribute large models (tensorflow.keras.applications, torch.utils.model_zoo, huggingface.transformers among many others) is that they never store models in their git repo. Instead, they have modules with references to remote object storage locations, handling the loading and caching of models.

Another possibility you can experiment with could be git-lfs (https://git-lfs.github.com/) but I don't know if it would play well with Docker/Kubernetes execution.

Cheers,

Alex

0 Kudos