August Release Notes: LLM Fine Tuning, Govern Status Monitoring, & Data Quality Enhancements
13.1 is here! It brings a raft of improvements to the LLM mesh as well as governance enhancements and new ways to visualize your data - let’s dive right in!
To see a overview of the changes check out this months What's New, Dataiku!
LLM Fine-Tuning
Up until now in order to apply fine-tuning techniques you have been required to implement it from scratch - that’s tricky for users without extensive coding experience and difficult to maintain for others. If you're rather hear than read about it we have just the video for you!
With the same philosophy adopted since the beginnings of the LLM Mesh, we are leveraging those providers’ capabilities to offer LLM fine-tuning in an agnostic way: the experience is the same, whether you are fine-tuning a model from a HuggingFace, OpenAI or Bedrock connection. LLM fine-tuning in Dataiku consists of two distinct experiences, allowing tuning your models in the way that best fits you. The common denominator is that they both integrate seamlessly in the LLM Mesh, meaning that the fine-tuned models are automatically registered back in the Mesh.
- No code recipe*: The new fine-tune recipe, available in the Early Adopter Program, is a unique low/no-code approach to fine-tuning that opens up fine-tuning to non-coders.
- Through code: In addition, you can do this through code if preferred, with full flexibility & customizability in fine-tuning local LLMs from HuggingFace Hub and the ability to access state-of-the-art techniques from the open source community, all while benefiting from the LLM Mesh.
LLM Mesh 13.1 Improvements
Speaking of the LLM mesh as usual we have been working to ensure you can access the latest and most powerful LLM models from various providers through the LLM Mesh's managed connections.
- Guardrails: specialized local models from HuggingFace for toxicity detection*
- LLM Mesh API to support function calling, other parameters, connecting external tools for advanced use cases (LLM agents, etc.)
- Support for Llama3 / Mistral (7B, 8*7B, Large) / Titan embeddings v2 / Cohere Command (R, R+) models through Bedrock
- Support for Gemma in HuggingFace connection
- Add “Clear data” option to Knowledge Banks handler
*These features are available as part of the Early Adopter Program. Please contact your CSM for details on how to get access.
Split in chunks processor in Prepare recipe for Advanced RAG
Chunking data is done prior to embedding in a vector store, and is a key step in training LLMs for use cases like RAG. Chunking techniques and parameters can have a great influence over the end result for augmented chatbots.
Now, in the prepare recipe, there is a “split into chunks” processor. This new processor allows you to specify separators, visualize the chunks interactively, and apply post-processing steps to ensure chunks are separated as expected. We expect this to enable you to experience better performance of RAG-trained models due to spotting issues proactively prior to storage in a vector store while granting greater visibility and transparency in how chunking is done to ensure trust in data
Govern Status Indicator in Unified Monitoring
Unified Monitoring for batch projects on the automation node has been updated with a new Govern card and status indicator. This allows you to get status on both batch projects and model endpoints fetching deployment status from Dataiku Govern without ever leaving the Unified Monitoring dashboard, to bring together the fullest and most complete view of ML project health all in a centralized view.
We hope this allows you to easier identify potential issues with Governed objects so that you can proactively investigate them before they have a larger impact.
Dataiku Govern Enhancements
Govern has had auditability enhancements allowing you to create a centralized view of all Dataiku Govern items events, with enhanced filtering capabilities and accessibility to all users. Custom filters now enable you to filter on more metadata, including conditional formatting and with and/or imbrication.
Data Quality Updates
This release brings multi-column support on all column-based rules, the ability to publish DQ statuses to dashboards, and Template updates - increased visibility into definitions & ability to create & edit templates from the instance list.
Data Prep
Multi-row formula
You can now utilize an optional offset argument to existing functions used to access a column value. The offset argument is available in the Prepare recipe only, in all processors that support Formula. With this new functionality, users can apply a formula using values from previous rows in the dataset. This feature can be used in use cases such as iterative calculations or auto increment ID.
Build flow zones from Scenario
When previously using a Scenario to automatically build parts of a Flow, you had to select target outputs to rebuild one at a time. In 13.1, you now have the option to build everything within a Zone as a Scenario step, drastically simplifying this workflow.
Visualization & Data Storytelling
This release saw new enhancements to aid you in visualizing your data and presenting it in effective ways.
Dashboards:
- UX and performance enhancements including page and title settings and the ability to hide/show pages
- Performance improvements when loading visible tiles
Charts:
- Median / percentile aggregation for numeric columns
- New gauge chart type
Git merge for project branches
In the DevOps process, Git is commonly used for project management and tracking. Until now, a third-party tool was required to merge DSS projects.
Now, you have the ability to merge branches of Dataiku projects from the Dataiku UI, improving the user experience and reducing the process complexity.
When the merge request is created, you can see the list of commits, see the diff (changes), resolve conflicts, and see the history of actions made on the request, all from within Dataiku.
That's all on 13.1! We hope you get great value out of this latest upgrade for Dataiku.
Cloud customers can expect to receive updates to their instance at the end of August.
You can view the full release notes here.