Error while redeploying model from a notebook
I am trying to build an adjusting model in Dataiku. I want to leverage the APIs to do the same.
I already have a model deployed in the flow. After each data refresh, I want to check the model's performance, and if it is below a threshold, I want to retrain the model. Below is the code I am using
if trained_model_MAPE > ERROR_THRESHOLD:
# Wait for the ML task to be ready
mltask.wait_guess_complete()
# Obtain settings, enable GBT, and save settings
settings = mltask.get_settings()
settings.set_algorithm_enabled("LEASTSQUARE_REGRESSION", True)
# Iterate over all features in the dataset and set their use/rejection
# settings.foreach_feature(handle_feature)
features_to_reject = []
def handle_feature(feature_name, feature_params):
if feature_name not in current_features and feature_params["role"] == 'INPUT':
features_to_reject.append(feature_name)
return feature_params
settings.foreach_feature(handle_feature)
for feature_name in current_features:
settings.use_feature(feature_name)
for feature_name in features_to_reject:
settings.reject_feature(feature_name)
settings.save()
mltask.start_train()
mltask.wait_train_complete()
# Get the identifiers of the trained models
ids = mltask.get_trained_models_ids()
mape_list = []
for id in ids:
details = mltask.get_trained_model_details(id)
algorithm = details.get_modeling_settings()["algorithm"]
mape = details.get_performance_metrics()["mape"]
print(f"Algorithm={algorithm} MAPE={mape}")
mape_list.append(mape)
#Select the best model
best_model_index = pd.Series(mape_list).idxmin()
# Deploy the best model
model_to_deploy = ids[best_model_index]
I have the following questions
1) How do I use a different dataset as the test data (Explicit extract from two dataset policy in design tab of lab) ? Currently, I am setting up the mltask as given below.
I want to use MasterData as my training data and another dataset, "UpcomingData", as my test data.
mltask = p.create_prediction_ml_task(
input_dataset="MasterData",
target_variable="DV",
ml_backend_type='PY_MEMORY', # ML backend to use
guess_policy='DEFAULT' # Template to use for setting default parameters
)
2) I want to use just two algorithms, GBT_REGRESSION and LEASTSQUARE_REGRESSION. Currently, there are some default algorithms as well. How do I restrict the use of algorithms?
3) I am getting the following error while redeploying the model in the flow. Why am I getting this error? I want to replace the model I built previously before the data refresh with a new model. Is it possible in Dataiku?
ret = mltask.redeploy_to_flow(model_to_deploy,saved_model_id="EXISTING_MODEL_ID")
DataikuException: com.dataiku.dip.exceptions.APIIllegalArgumentException: Saved model is not built by a recipe of this project
Operating system used: Windows
Answers
-
- You would use set_split_explicit on the mltask settings' split params obtained via get_split_params
- You could call disable_all_algorithms before enabling the ones you want.
- This depends a bit on the specifics of your project and may need some of your' project's configuration to troubleshoot. I suggest getting in touch with Dataiku's customer support.
-
@AdrienL
Thanks for the reply.
Can you elaborate on the first?
I'm still not able to use two different datasets, one as a train and one as a test. I tried as given below.settings.get_split_params().set_split_explicit(test_dataset_name="UpcomingData_Toronto")
it is giving an error below
TypeError: set_split_explicit() missing 2 required positional arguments: 'train_selection' and 'test_selection'