Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
I'm trying to train a model in lab using the API from a notebook.
I'm using the below code to setup the ML task.I'm currently using the "MasterData" as my data. I want to use a different dataset "UpcomingData" as my test data and "MasterData" as train data ((Explicit extract from two dataset policy in design tab of lab).
How do I achieve this? What should be the code changes I need to make?.
From a previous question in community I understood that , I could use set_split_explicit on the mltask settings' split params obtained via get_split_params.
Thanks in advance!
mltask = p.create_prediction_ml_task(
input_dataset="MasterData",
target_variable="DV",
ml_backend_type='PY_MEMORY', # ML backend to use
guess_policy='DEFAULT' # Template to use for setting default parameters
)
Operating system used: Windows
Hi,
Indeed you would need to obtain and already created split params.
The easiest way is probably to create another MLTask manually via the UI, use get_settings, retrieve that, and apply the changes you need, e.g., a different dataset name or whole settings.get_raw()['splitParams'] from the pre-created model
After you create an analysis, you can use splitParams dict from another visual model.
import dataiku
client = dataiku.api_client()
p = client.get_project("TUT_EXTERNAL_MODELS")
#mltask = p.get_ml_task("<analysis_id>", "<mltask_id>")
mltask = p.get_ml_task("8Nkypr4s", "AtOStv1l")
settings = mltask.get_settings()
settings.get_raw()['splitParams']['eftdTrain']['datasetSmartName'] = "train_split_b"
settings.get_raw()['splitParams']['eftdTest']['datasetSmartName'] = "train_split_c"
settings.save()
Thanks
@AlexT , I didn't follow the solution you provided
I see a method set_split_explicit to set the train/test split to an explicit extract from one or two dataset(s).
I tried it as follows
settings.get_split_params().set_split_explicit(train_selection,test_selection,
test_dataset_name="UpcomingData_Toronto")
Not sure what is expected out of train_selection and test_selection arguments.
In the documentation it is given as below.
train_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the train dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
test_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the test dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
What should I provide as arguments for train_selection and test_selection in the set_split_explicit ?