Automated model production

AndyPryke · ‎01-17-2020

We're looking at whether Dataiku is suitable for our workload which is something like this:

We currently have a system which builds multiple models for multiple clients, for example churn prediction, product recommendation systems. We use an automated process to produce a few hundred models each week based on new data uploaded by clients. Typically we re-train all models each week to take account of changes such as new products, new stores or changes in client data processing (e.g. they added a new custom field, or re-coded something), though some we do daily and some models we "freeze" for consistency. The models are used for batch scoring.

Different models have different fields (e.g. some clients have specific data such as sales office or distance to depot). For binary prediction, our larger datasets are ~1m customers per client (e.g. rows) with ~300 features. For recommendations ~1m transactions.

As a newbie I have some questions about Dataiku:

How hard is it to get Dataiku to produce many models automatically, where the fields / data used vary?
With hundreds of models we don't want to validate them all by hand. Do the model monitoring features allow alerts when new models differ significantly from prior ones, or when dataset drift has occurred between train and score?
The data is relatively small, so we currently do most of our work in-memory. Is this possible in Dataiku or do we always need to write to a database/storage between steps?
Which features we should pay special attention to the documentation?
Are other organisations using Dataiku in this way?

Thanks in advance for any advice,

Andy

lpkronek · ‎01-20-2020

Hello,

Dataiku DSS will provide lots of interesting features to complete this type of project.

Let me try to provide some insights on your questions :

Many features of Dataiku are based on checking and validating dataset schema so using datasets where fields vary requires careful design. An option could be to encompass the set of fields that may vary into a single json object so that the schema of datasets is fixed. The JSON object could be unnested only when needed (for ML tasks for instance).
You can leverage Metrics and Check capabilities of DSS to automatically monitor your models
When working leveraging in-memory compute engines, DSS will have to write datasets between each recipe. Dataiku supports dataset virtualization and pipeline generation when working with other computing engines like Spark or SQL.
In addition to metrics and checks already mentioned, all the features around automation scenarios, variables, and partitionning might be interesting for your project.
You can check the following presentation. There are some similarities with your use case.

LPK

AndyPryke · ‎01-21-2020

Thanks LPK - very useful!

Sign up to take part

Automated model production

Automated model production