Error while redeploying model from a notebook
I am trying to build an adjusting model in Dataiku. I want to leverage the APIs to do the same.
I already have a model deployed in the flow. After each data refresh, I want to check the model's performance, and if it is below a threshold, I want to retrain the model. Below is the code I am using
if trained_model_MAPE > ERROR_THRESHOLD: # Wait for the ML task to be ready mltask.wait_guess_complete() # Obtain settings, enable GBT, and save settings settings = mltask.get_settings() settings.set_algorithm_enabled("LEASTSQUARE_REGRESSION", True) # Iterate over all features in the dataset and set their use/rejection # settings.foreach_feature(handle_feature) features_to_reject = [] def handle_feature(feature_name, feature_params): if feature_name not in current_features and feature_params["role"] == 'INPUT': features_to_reject.append(feature_name) return feature_params settings.foreach_feature(handle_feature) for feature_name in current_features: settings.use_feature(feature_name) for feature_name in features_to_reject: settings.reject_feature(feature_name) settings.save() mltask.start_train() mltask.wait_train_complete() # Get the identifiers of the trained models ids = mltask.get_trained_models_ids() mape_list = [] for id in ids: details = mltask.get_trained_model_details(id) algorithm = details.get_modeling_settings()["algorithm"] mape = details.get_performance_metrics()["mape"] print(f"Algorithm={algorithm} MAPE={mape}") mape_list.append(mape) #Select the best model best_model_index = pd.Series(mape_list).idxmin() # Deploy the best model model_to_deploy = ids[best_model_index]
I have the following questions
1) How do I use a different dataset as the test data (Explicit extract from two dataset policy in design tab of lab) ? Currently, I am setting up the mltask as given below.
I want to use MasterData as my training data and another dataset, "UpcomingData", as my test data.
mltask = p.create_prediction_ml_task( input_dataset="MasterData", target_variable="DV", ml_backend_type='PY_MEMORY', # ML backend to use guess_policy='DEFAULT' # Template to use for setting default parameters )
2) I want to use just two algorithms, GBT_REGRESSION and LEASTSQUARE_REGRESSION. Currently, there are some default algorithms as well. How do I restrict the use of algorithms?
3) I am getting the following error while redeploying the model in the flow. Why am I getting this error? I want to replace the model I built previously before the data refresh with a new model. Is it possible in Dataiku?
ret = mltask.redeploy_to_flow(model_to_deploy,saved_model_id="EXISTING_MODEL_ID")
DataikuException: com.dataiku.dip.exceptions.APIIllegalArgumentException: Saved model is not built by a recipe of this project
Operating system used: Windows
Answers
-
- You would use set_split_explicit on the mltask settings' split params obtained via get_split_params
- You could call disable_all_algorithms before enabling the ones you want.
- This depends a bit on the specifics of your project and may need some of your' project's configuration to troubleshoot. I suggest getting in touch with Dataiku's customer support.
-
@AdrienL
Thanks for the reply.
Can you elaborate on the first?
I'm still not able to use two different datasets, one as a train and one as a test. I tried as given below.settings.get_split_params().set_split_explicit(test_dataset_name="UpcomingData_Toronto")
it is giving an error below
TypeError: set_split_explicit() missing 2 required positional arguments: 'train_selection' and 'test_selection'