Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on February 20, 2025 10:46AM
Likes: 0
Replies: 3
Hi all, dataiku novice here.
I have trained models in python notebooks with mlflow, then I deployed a model to the flow(from experiment tracking) in order to score & evaluate it. I have 50 features, and when I input the feature table with 50 columns into the scoring recipe, everything is fine and i get my predictions.
The problem is since my row identification column (lets call it customer_id) is not in the feature set, I can not join the predictions back to customers, therefore I do not know which prediction is which customer's. I managed to just horizontally stack the predictions back to my input dataframe using a new notebook, but this does not seem like an elegant solution.
If i try to input the table with feature table + customer_id (51 columns) into to the scoring recipe, it throws an error saying number of features mismatch.
I believe I have seen people adding id/index columns to their input dataframes for score/evaluate recipes and they seem to get their predictions with id columns still present. So what am I missing here?
Thank you for your responses in advance.
Operating system used: W11
Welcome to the Dataiku, community. We are glad to have you here with us.
In general you are correct. You need an id key column in the dataset you are scoring so you can connect it back to other data you might have. (This is common, and I do it often.). So right idea there on your part.
I think that you should focus on “If i try to input the table with feature table + customer_id (51 columns) into to the scoring recipe, it throws an error saying number of features mismatch.” With this description I don’t have enough information to understand exactly how this error is coming up. For the community to help you, we will need some more details.
Although this description makes me wonder, How are you trying to build the model. (In general I’ve found with the visual model builders that I have to re-build and redeploy the model whenever I add, remove or change the type of a feature in the data set. Even if I’m ignoring a column in the model like a customer key. (Remember to exclude the customer number from the list of features. Or the model is very likely to ever fit.)
Finally if you have a key in your dataset, in general I would just pass it through the model building and scoring phase of your flow. The only reason I’d create a new id column is because the existing data coming into the flow did not have a unique key, and I needed to make a join somewhere later in my process.
Hope that might help a bit. Others in the community may have further insights particularly if you provide further details.
Thank you for your answer.
The exact error from the scoring recipe is this when i try to include customer_id column in the input dataset is:
Regarding model building, this is the code i use to train and log model in experiment tracking tab. From there, i deploy to flow by clicking 'Deploy the Model' button.
run_name = 'run_123'
experiment_name = "experiment_123"
mlflow_extension = project.get_mlflow_extension()
with project.setup_mlflow(managed_folder=managed_folder) as mlflow_client:
mlflow_client.set_experiment(experiment_name)
with mlflow_client.start_run(run_name=run_name) as run:
mlflow_client.lightgbm.autolog()
# selected features is a list of 50 feature columns (no customer_id)
X_train = train[selected_features]
y_train = train[['target']]
model = lgb.LGBMClassifier(**params)
model.fit(X_train,y_train)
classes = model._classes.tolist()
mlflow_extension.set_run_inference_info(
run_id=run._info.run_id,
prediction_type='BINARY_CLASSIFICATION',
classes=classes,
target='target'
)
The original train dataframe does have the column 'customer_id'. I dont know how to pass it through the model building and scoring phase of your flow.
@esmvy ,
I don't have a good answer for you at this time. And I don't have a bunch of time to followup. Several thoughts.
—Tom