Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am trying to build an adjusting model in Dataiku. I want to leverage the APIs to do the same.
I already have a model deployed in the flow. After each data refresh, I want to check the model's performance, and if it is below a threshold, I want to retrain the model. Below is the code I am using
if trained_model_MAPE > ERROR_THRESHOLD:
# Wait for the ML task to be ready
mltask.wait_guess_complete()
# Obtain settings, enable GBT, and save settings
settings = mltask.get_settings()
settings.set_algorithm_enabled("LEASTSQUARE_REGRESSION", True)
# Iterate over all features in the dataset and set their use/rejection
# settings.foreach_feature(handle_feature)
features_to_reject = []
def handle_feature(feature_name, feature_params):
if feature_name not in current_features and feature_params["role"] == 'INPUT':
features_to_reject.append(feature_name)
return feature_params
settings.foreach_feature(handle_feature)
for feature_name in current_features:
settings.use_feature(feature_name)
for feature_name in features_to_reject:
settings.reject_feature(feature_name)
settings.save()
mltask.start_train()
mltask.wait_train_complete()
# Get the identifiers of the trained models
ids = mltask.get_trained_models_ids()
mape_list = []
for id in ids:
details = mltask.get_trained_model_details(id)
algorithm = details.get_modeling_settings()["algorithm"]
mape = details.get_performance_metrics()["mape"]
print(f"Algorithm={algorithm} MAPE={mape}")
mape_list.append(mape)
#Select the best model
best_model_index = pd.Series(mape_list).idxmin()
# Deploy the best model
model_to_deploy = ids[best_model_index]
I have the following questions
1) How do I use a different dataset as the test data (Explicit extract from two dataset policy in design tab of lab) ? Currently, I am setting up the mltask as given below.
I want to use MasterData as my training data and another dataset, "UpcomingData", as my test data.
mltask = p.create_prediction_ml_task(
input_dataset="MasterData",
target_variable="DV",
ml_backend_type='PY_MEMORY', # ML backend to use
guess_policy='DEFAULT' # Template to use for setting default parameters
)
2) I want to use just two algorithms, GBT_REGRESSION and LEASTSQUARE_REGRESSION. Currently, there are some default algorithms as well. How do I restrict the use of algorithms?
3) I am getting the following error while redeploying the model in the flow. Why am I getting this error? I want to replace the model I built previously before the data refresh with a new model. Is it possible in Dataiku?
ret = mltask.redeploy_to_flow(model_to_deploy,saved_model_id="EXISTING_MODEL_ID")
DataikuException: com.dataiku.dip.exceptions.APIIllegalArgumentException: Saved model is not built by a recipe of this project
Operating system used: Windows
@AdrienL Thanks for the reply.
Can you elaborate on the first?
I'm still not able to use two different datasets, one as a train and one as a test. I tried as given below.
settings.get_split_params().set_split_explicit(test_dataset_name="UpcomingData_Toronto")
it is giving an error below
TypeError: set_split_explicit() missing 2 required positional arguments: 'train_selection' and 'test_selection'