Predictions made using Dataiku Snowpark API Yield Single Class for Multi-Class Classification model

Suhail
Suhail Registered Posts: 18 ✭✭✭✭

Hello community,

I am facing an issue when using Dataiku's Snowpark API to make predictions from a multi-class classification model.

The model is trained with Dataiku and a Snowflake table via Visual ML.

The predictions only return one class, while the same table loaded as a pandas DataFrame returns correct predictions with multiple classes.

Even when the Snowflake table is transformed into a pandas DataFrame and then predicted, the predictions are still incorrect with only one class.

Steps to Reproduce:

  1. Train a K-means multi-class classification model using Dataiku and a snowflake table vis Visual ML.
  2. deploy the model on api deployer
  3. Open a python jupyter notebook and use the Dataiku Snowpark API to read the data
  4. Create a prediction on the trained model using the read data.
  5. Observe that all predictions are of a single class.
  6. Read the same Snowflake table using Dataiku read dataframe method to read the data as a pandas DataFrame.
  7. Run predictions on the pandas DataFrame.
  8. Observe that predictions are as expected, with multiple classes.
  9. Convert the Snowflake table to a pandas DataFrame and then predict.
  10. Observe that predictions are still incorrect with all being the same class.

Code to read data using snowpark api

input_dataset = dataiku.Dataset("inference_data")
dku_snowpark = DkuSnowpark()
snowpark_session = dku_snowpark.create_session(
    connection_name="SNOWFLAKE_CONNECTION", 
    project_key=dataiku.default_project_key()
)
dataset_dataframe = dku_snowpark.get_dataframe(dataset=input_dataset, session=snowpark_session)

Code to read data as a pandas df:

input_dataset = dataiku.Dataset("inference_data")
dataset_dataframe = input_dataset.get_dataframe()

Code to run predictions:

client = dataikuapi.APINodeClient(apinode_endpoint, "Model_AutoMl")
prediction = client.predict_records("Clustering_Model", dataset_dataframe)['results']

This is the data and results when using snowpark api

Screenshot 2024-05-27 at 5.00.43 PM.png

Screenshot 2024-05-27 at 5.00.59 PM.png

This is the data and result when using pandas dataframe  

Screenshot 2024-05-27 at 5.01.37 PM.png

Screenshot 2024-05-27 at 5.01.43 PM.png

  

Any insights or solutions to resolve this inconsistency would be greatly appreciated. Please let me know if additional information is required.

Thank you for your support.


Operating system used: Mac OS

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,227 Dataiker

    Hi @Suhail
    ,


    Could you please open a support ticket for this and share job diagnostics from both variants?
    Thanks

Setup Info
    Tags
      Help me…