PythonノートブックからカスタムLightGBM（5-Fold）モデルをSaved Modelに登録する際のエラーについて

t_ueno · June 10

Dataikuコミュニティの皆様、こんにちは。

現在、Jupyter Notebook（Pythonコードレシピ）で作成したカスタムのLightGBMモデルを、Dataikuの「Saved Model（保存済みモデル）」としてフローに登録し、ScoreレシピやEvaluateレシピと連携させようとしています。

予測時（推論時）のカテゴリ型のマッピングのズレ防止、および5つのFoldのアンサンブル（平均）予測を内包させるため、MLflowの mlflow.pyfunc.PythonModel を使用して以下のようなカスタムラッパークラスを実装しました。

1. 参考にしたドキュメント

公式チュートリアルの「Step 3: ML Deploy and Eval」を参考にしています。 https://developer.dataiku.com/latest/tutorials/machine-learning/quickstart-tutorial/step3_ml_deploy_and_eval.html

2. 使用している環境

Dataikuのバージョン：[※お使いのバージョンが分かれば「v12.x」などと記載してください。不明なら削除でOKです]
実行環境：コンテナ環境（独自に作成した catboost_env コード環境）上のJupyter Notebook

3. 実装したコード（抜粋）

Python

import mlflow
import mlflow.pyfunc
import lightgbm as lgb
import numpy as np
import dataiku

# カスタムラッパークラスの定義
class LightGBMEnsembleWrapper(mlfunc.pyfunc.PythonModel):
    def __init__(self, models, cat_dtypes, drop_cols):
        self.models = models
        self.cat_dtypes = cat_dtypes
        self.drop_cols = drop_cols

    def predict(self, context, model_input):
        df_features = model_input.copy()
        df_features = df_features.drop(columns=[c for c in self.drop_cols if c in df_features.columns], errors='ignore')
        for col, dtype in self.cat_dtypes.items():
            if col in df_features.columns:
                df_features[col] = df_features[col].astype(dtype)
        
        preds = np.zeros(len(df_features))
        for model in self.models:
            preds += model.predict(df_features, num_iteration=model.best_iteration)
        preds /= len(self.models)
        return np.column_stack((1 - preds, preds))

# 5-Fold交差検証の学習後、MLflowへの保存とSaved Modelへの登録処理
with mlflow.start_run() as run:
    ensemble_model = LightGBMEnsembleWrapper(models=trained_models, cat_dtypes=cat_dtypes, drop_cols=drop_cols)
    mlflow.pyfunc.log_model(artifact_path="lgb_5fold_model", python_model=ensemble_model)
    run_id = run.info.run_id

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())

# 空のSaved Model（BINARY_CLASSIFICATION）を新規作成、または既存のものを取得
# （中略：saved_modelオブジェクトを取得）

# ★ここでエラーが発生します
saved_model.import_mlflow_version_from_run(
    version_id=version_id,
    run_id=run_id,
    path="lgb_5fold_model"
)

4. 躓いているポイント（エラー内容）

① コード実行時のエラー

上記の登録処理（import_mlflow_version_from_run）を実行したところ、以下のAttributeErrorが発生し、空のSaved Modelの箱だけが作成されて中身が流し込めません。

AttributeError: 'DSSSavedModel' object has no attribute 'import_mlflow_version_from_run'

② UI（画面操作）からのインポート時の問題

コードでの登録を諦め、Saved Modelの「New Model Version」ポップアップ画面から、Dataiku's Experiment Tracking 経由で手動インポートしようと試みました。
しかし、コンテナ環境（catboost_env）でノートブックを動かしているためか、ExperimentやRunの選択欄が「No items found」となってしまい、ノートブック側で記録されたMLflowの履歴が画面側から認識されません。

5. お伺いしたいこと

コンテナ環境（Kubernetes等の独自コード環境）でMLflow Trackingを使用している場合、UI側からその実験履歴（Run）を認識させるために必要な設定（環境変数やシークレット、トラッキングサーバーの指定など）はありますでしょうか？
現バージョンのDataikuにおいて、DSSSavedModel に対する正しいMLflowモデルのインポートAPIの書き方、または上記のエラーを回避するベストプラクティスがあればご教授いただきたいです。

何か見落としはございますでしょうか？
何卒よろしくお願いいたします。

Dataiku version used: Version 14.5.1

TomWiley · June 10

Hi !

The error you are seeing is because the method import_mlflow_version_from_run does not exist in the dataiku API.

To answer your first question, yes, you need to ensure that the code in your notebook correctly uses the DSSProject.setup_mlflow method with a managed folder in the flow. This will ensure that any mlflow experiments are correctly registered in Dataiku and are saved to your managed folder.

A quick example would look like:

import dataiku
from dataiku import api_client

client = dataiku.api_client()
project = client.get_default_project()

# This folder stores MLflow artifacts
managed_folder = project.get_managed_folder("MLFLOWF1")  # replace with your folder ID

# Context manager wires DSS + MLflow tracking together
with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:
    mlflow_handle.set_experiment("my_first_experiment")  # optional, sets experiment name
    with mlflow_handle.start_run(run_name="my_first_run") as run:
        # Use regular MLflow APIs here
        import mlflow

        # Log parameters and metrics
        mlflow.log_param("n_estimators", 100)
        mlflow.log_metric("auc", 0.91)

A full tutorial on this can be found here.

Importantly, you need to use the mlflow_handle (given in this example in line 11) instead of directly using the mlflow package itself.

For your second question:

I think you want to use import_mlflow_version_from_managed_folder. Docs

In your current code, this would look something like the following (untested):

import mlflow
import mlflow.pyfunc
import numpy as np
import dataiku


class LightGBMEnsembleWrapper(mlflow.pyfunc.PythonModel):
    def __init__(self, models, cat_dtypes, drop_cols):
        self.models = models
        self.cat_dtypes = cat_dtypes
        self.drop_cols = drop_cols

    def predict(self, context, model_input):
        df_features = model_input.copy()
        df_features = df_features.drop(columns=[c for c in self.drop_cols if c in df_features.columns], errors='ignore')
        for col, dtype in self.cat_dtypes.items():
            if col in df_features.columns:
                df_features[col] = df_features[col].astype(dtype)
        
        preds = np.zeros(len(df_features))
        for model in self.models:
            preds += model.predict(df_features, num_iteration=model.best_iteration)
        preds /= len(self.models)
        return np.column_stack((1 - preds, preds))

# Managed folder used by MLflow artifacts in Dataiku
experiments_folder_name = "Binary classif experiments"
experiments_folder_id = dataiku.Folder(experiments_folder_name).get_id()
experiments_folder = project.get_managed_folder(experiments_folder_id)

# Log the ensemble MLflow model
with project.setup_mlflow(managed_folder=experiments_folder) as dku_mlflow:
    with dku_mlflow.start_run() as run:
        ensemble_model = LightGBMEnsembleWrapper(
            models=trained_models,
            cat_dtypes=cat_dtypes,
            drop_cols=drop_cols,
        )
        mlflow.pyfunc.log_model(
            artifact_path="lgb_5fold_ensemble",
            python_model=ensemble_model,
        )
        artifact_uri = run.info.artifact_uri

existing_versions = saved_model.list_versions()
version_id = f"ensemble-v{len(existing_versions) + 1}"

# Build the path relative to the managed folder root
model_path_in_managed_folder = (
    artifact_uri.split(experiments_folder_id, 1)[1].lstrip("/")
    + "/lgb_5fold_ensemble"
)

saved_model_version = saved_model.import_mlflow_version_from_managed_folder(
    version_id=version_id,
    managed_folder=experiments_folder,
    path=model_path_in_managed_folder,
)

The tutorial you referenced has an example of this method being used in step 4.

Hope this helps !

Best regards,

Tom

t_ueno · June 11

Hi,Tom

I’m sorry.

Despite your response, I was unable to resolve the issue.

I’ve tried various things since then to preserve the audit trail, but because MLflow is too difficult to use, I’ve decided to abandon the idea of using it.

Best regards,

t-ueno

PythonノートブックからカスタムLightGBM（5-Fold）モデルをSaved Modelに登録する際のエラーについて

1. 参考にしたドキュメント

2. 使用している環境

3. 実装したコード（抜粋）

4. 躓いているポイント（エラー内容）

① コード実行時のエラー

② UI（画面操作）からのインポート時の問題

5. お伺いしたいこと

Answers

Categories

Setup Info

Tags