Best practice hosting hugging face LLMs as a service?

Cory
Cory Registered Posts: 3 ✭✭

Hi all

Generally speaking, what are the optimal routes in Dataiku to host e.g. an instruct fine tuned Falcon 7B model using Dataiku? Would it be building a code studio and using vLLM or something along those lines? Or is there capability as part of the LLM mesh? We'd like to host open source models that are instruct fine tuned for specific use cases wherein a downstream web application can make calls

Operating system used: Ubuntu 20.04

Operating system used: Ubuntu 20.04

Answers

  • Grixis
    Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 82 ✭✭✭✭✭

    Hello @Cory ,

    If you have to deploy your model completly on dataiku I guess Dataiku's LLM Mesh is the optimal route, if provided, is designed as a dedicated infrastructure to manage your LLMs. Using this feature can simplify the deployment and management of large-scale models. And I dont see any problem to have an api services at the end of the pip.

    But, your question is vague so I cant understand whats your scalability, integration and security constraints.

    I think, you can find some exemple of this kind of use case in the gallery or on Dataiku ressources.

  • Cory
    Cory Registered Posts: 3 ✭✭
    edited July 22

    Apologies @Grixis I should have provided more detail.

    I know the LLM mesh allows for a lot of flexibility. I also know that typically it is used in a workflow where Dataiku downloads a pre-trained model or model from huggingface. If our team has a workflow where we've instruct fine tuned an open source model, but we cannot load this model back to huggingface (because of infosec reasons), and want to use it for real time inference in a web application not hosted on dataiku via api services deployed in Dataiku, what's the best route to take? Currently, it seems like the LLM mesh apis require DKU to spin up a container per each call as well as have a model ID on huggingface, so not sure how they get hosted in a "production" setting given our constraints.

    I guess I'm just asking: what's the best way to host a custom LLM service with a couple separate endpoints if we cant use huggingface? Likely I'm guessing it's code resources/container add ons when building an env to set up and then deploying a dataiku custom python function endpoint(s) in a GPU container?

Setup Info
    Tags
      Help me…