Hopefully, you’ve had some time to digest that big meal because, believe it or not, we’re already back with a snack: Dataiku 12.1! Read on to discover more about the many product enhancements, additions, and incremental improvements delivered in the latest update, and find out how they might be relevant to you and your work.
The Cream of the Crop
The five features highlighted below significantly improve current processes, and I suspect at least a couple of them will make their way into your day-to-day Dataiku usage patterns. Let’s take a peek!
Dataset Preview and Recipe Summaries in the Flow
When exploring a project pipeline, often the fastest way to understand a dataset is simply to examine its columns and values. Although you can review the schema of a dataset in the right-hand panel, now you can also open a preview panel to inspect the first 50 rows of the dataset, all without ever leaving the Flow.
Similarly, to quickly understand the scope of transformations occurring in any prepare, stack, group, or filter recipe, simply select the recipe and head over to the Details tab of the right-hand panel in the Flow. Here, you can review a summary of the preparation steps, join or group by conditions, and pre/post filters at a glance, all without needing to enter into the recipe itself. Brilliant!
Databricks Connect Support
Databricks’ announcement of Databricks Connect enables developers to write PySpark code in a remote environment (think: Dataiku code recipe or notebook) to execute on Databricks. Integrated seamlessly through our Python API, Dataiku can connect to your Databricks cluster by referencing the already established Dataiku connection, eliminating the need to enter credential information each time. After loading in the dataset as a dataframe, write familiar PySpark code to perform data processing.
Left and Right Anti-Joins
In Dataiku 11.3, we introduced a “join with unmatched outputs” option in the join recipe to produce an additional output dataset containing the unmatched rows (aka, an “anti-join” dataset). In the 12.1 update, you now have the ability to run a join recipe that only outputs the unmatched data. A left anti-join keeps rows that do not have any match with the right, and the opposite is true for the right anti-join.
Managed Labeling for Text Annotation
As you may know, Dataiku provides a managed labeling framework so teams can distribute the work of annotating large volumes of images for computer vision tasks. Now, teams can also federate text annotation workloads using the same labeling framework to generate high-quality training data for NLP tasks such as text classification, Named Entity Recognition (NER), and sentiment analysis. How does it work? Project managers set up the labeling task and track the ongoing status and annotator performance; annotators tag or categorize spans of text; and subject matter experts can validate labels and resolve labeling conflicts.
HOT TIP: As an aside, since the explosion of GPT and other large language models has disrupted the traditional ways of performing NLP tasks, I also imagine Dataiku’s managed labeling capabilities could be easily repurposed as a post-hoc validation framework for content created by Generative AI algorithms.
Time Series Decomposition and Automated Model Documentation for Forecasting
For you time series enthusiasts out there, try out the new time series decomposition in the statistics tab of your dataset or take advantage of automated model documentation for time series forecasting models in Dataiku’s Visual ML.
More Goodies To Peruse
Beyond those top five features, there are several more feature improvements you might be interested in. Let’s run through some quick hits in lightning-round style.
Many Enhancements to Dataiku Charting Options
Enjoy new ways to configure charts in Dataiku, including relative date filters, legend and axis font formatting (font size, font color, background, etc.), customizable reference lines for bar charts and scatterplots, regression line on scatter plots, and the ability to create a new dashboard or dashboard slide from the publish modal.
New Performance Metrics for ML Classification Tasks
Along with the ROC curve presented in Dataiku Visual ML for each training experiment, you can now also review the Precision-Recall (PR) curve and average precision metric to review the tradeoff between precision and recall at different classification thresholds and approximate the area under the PR curve.
The PR curve and average precision metric are similar to the ROC curve and AUC metric combination, but better suited for evaluating a model’s ability to correctly identify positive instances in the presence of a large number of negative instances (e.g., rare event prediction or anomaly detection use cases).
Bundle Creation Permissions for Write-Only User Profiles
Since not all users who may need to create a project bundle for deployment may have project admin privileges, as of Dataiku 12.1, one only needs to have “write project content” permissions in order to create a project bundle. Yay!
Search Notebooks for Elasticsearch
In Dataiku 11.2, we released the Dataset Search feature, which offers a new tab on any Elasticsearch dataset, enabling users to leverage native Elasticsearch search capabilities from within your Dataiku interface.
To enable users to search on multiple datasets, or to let them search across multiple Elasticsearch indices right into the cluster without necessarily having a corresponding Dataiku dataset, we have created Search Notebooks. Search Notebooks are similar to SQL notebooks but for Elasticsearch data sources.
Spark Support for Auto Feature Generation
The new “generate features” visual recipe released in Dataiku 12.0 for SQL-based datasets now natively supports Spark as a run time engine so that you can perform auto feature generation in a Spark cluster.
Deployment Logs, History, and a Downloadable Diagnostic Report
When deploying a project or an API, things can go wrong. Getting a clear, historized, and shareable log of what happened is important to diagnose and solve those issues. A new tab has been added to each deployment which contains all the logs of past deployments. The new ability to download those logs, along with additional data, as a “deployment diagnostic” to share with Dataiku tech support will be especially useful when troubleshooting issues in API Kubernetes deployments.
Learn More About Dataiku 12.1
As always, you can visit the official Release Notes to get more details and reference documentation on these product enhancements. But don’t get too comfy — the next minor release is already in the works, so stay tuned for another update from me soon. I encourage you to try out these new feature updates for yourself, and to let us know what you think in the comments!