Cant figure out how to match predictions with customers.

esmvy
esmvy Registered Posts: 2 ✭✭
edited February 20 in Using Dataiku

Hi all, dataiku novice here.

I have trained models in python notebooks with mlflow, then I deployed a model to the flow(from experiment tracking) in order to score & evaluate it. I have 50 features, and when I input the feature table with 50 columns into the scoring recipe, everything is fine and i get my predictions.

The problem is since my row identification column (lets call it customer_id) is not in the feature set, I can not join the predictions back to customers, therefore I do not know which prediction is which customer's. I managed to just horizontally stack the predictions back to my input dataframe using a new notebook, but this does not seem like an elegant solution.

If i try to input the table with feature table + customer_id (51 columns) into to the scoring recipe, it throws an error saying number of features mismatch.

I believe I have seen people adding id/index columns to their input dataframes for score/evaluate recipes and they seem to get their predictions with id columns still present. So what am I missing here?

Thank you for your responses in advance.

Operating system used: W11

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,618 Neuron

    @esmvy,

    Welcome to the Dataiku, community. We are glad to have you here with us.

    In general you are correct. You need an id key column in the dataset you are scoring so you can connect it back to other data you might have. (This is common, and I do it often.). So right idea there on your part.

    I think that you should focus on “If i try to input the table with feature table + customer_id (51 columns) into to the scoring recipe, it throws an error saying number of features mismatch.” With this description I don’t have enough information to understand exactly how this error is coming up. For the community to help you, we will need some more details.

    Although this description makes me wonder, How are you trying to build the model. (In general I’ve found with the visual model builders that I have to re-build and redeploy the model whenever I add, remove or change the type of a feature in the data set. Even if I’m ignoring a column in the model like a customer key. (Remember to exclude the customer number from the list of features. Or the model is very likely to ever fit.)

    Finally if you have a key in your dataset, in general I would just pass it through the model building and scoring phase of your flow. The only reason I’d create a new id column is because the existing data coming into the flow did not have a unique key, and I needed to make a join somewhere later in my process.

    Hope that might help a bit. Others in the community may have further insights particularly if you provide further details.

  • esmvy
    esmvy Registered Posts: 2 ✭✭

    Thank you for your answer.

    The exact error from the scoring recipe is this when i try to include customer_id column in the input dataset is:

    Job failed: Error in python process: <class 'ValueError'>: Number of features of the model must match the input. Model n_features_ is 50 and input n_features is 51

    Regarding model building, this is the code i use to train and log model in experiment tracking tab. From there, i deploy to flow by clicking 'Deploy the Model' button.

    run_name = 'run_123'
    experiment_name = "experiment_123"
    mlflow_extension = project.get_mlflow_extension()


    with project.setup_mlflow(managed_folder=managed_folder) as mlflow_client:
    mlflow_client.set_experiment(experiment_name)
    with mlflow_client.start_run(run_name=run_name) as run:

    mlflow_client.lightgbm.autolog()
    # selected features is a list of 50 feature columns (no customer_id)

    X_train = train[selected_features]
    y_train = train[['target']]

    model = lgb.LGBMClassifier(**params)

    model.fit(X_train,y_train)

    classes = model._classes.tolist()

    mlflow_extension.set_run_inference_info(
    run_id=run._info.run_id,
    prediction_type='BINARY_CLASSIFICATION',
    classes=classes,
    target='target'
    )

    The original train dataframe does have the column 'customer_id'. I dont know how to pass it through the model building and scoring phase of your flow.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,618 Neuron

    @esmvy ,

    I don't have a good answer for you at this time. And I don't have a bunch of time to followup. Several thoughts.

    1. I've seen problems like this when using visual models. I'd add a column to datasets prior to the model inference. Then I'd have to regenerate the model to include the new column and re-publish the model. Then I could run the model on the new dataset.
    2. Based on the Python you are showing, it appears that you are not using the visual model building approach. I've not build models in Dataiku that way, so, I'm not clear what you might be doing. Others may be able to understand exactly what you are doing from this bit of code. However, I am not.
    3. If I were you, I would likely open up a support ticket. The Technical support team at Dataiku are often very good to excellent. They can get some diagnostics and more accurately identify the root cause.

    —Tom