Using two different datasets to train a model in lab using API
Hi,
I'm trying to train a model in lab using the API from a notebook.
I'm using the below code to setup the ML task.I'm currently using the "MasterData" as my data. I want to use a different dataset "UpcomingData" as my test data and "MasterData" as train data ((Explicit extract from two dataset policy in design tab of lab).
How do I achieve this? What should be the code changes I need to make?.
From a previous question in community I understood that , I could use set_split_explicit on the mltask settings' split params obtained via get_split_params.
Thanks in advance!
mltask = p.create_prediction_ml_task( input_dataset="MasterData", target_variable="DV", ml_backend_type='PY_MEMORY', # ML backend to use guess_policy='DEFAULT' # Template to use for setting default parameters )
Operating system used: Windows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
Indeed you would need to obtain and already created split params.
The easiest way is probably to create another MLTask manually via the UI, use get_settings, retrieve that, and apply the changes you need, e.g., a different dataset name or whole settings.get_raw()['splitParams'] from the pre-created model
After you create an analysis, you can use splitParams dict from another visual model.
import dataiku
client = dataiku.api_client()
p = client.get_project("TUT_EXTERNAL_MODELS")
#mltask = p.get_ml_task("<analysis_id>", "<mltask_id>")
mltask = p.get_ml_task("8Nkypr4s", "AtOStv1l")
settings = mltask.get_settings()
settings.get_raw()['splitParams']['eftdTrain']['datasetSmartName'] = "train_split_b"
settings.get_raw()['splitParams']['eftdTest']['datasetSmartName'] = "train_split_c"
settings.save()Thanks
-
@AlexT
, I didn't follow the solution you provided
I see a method set_split_explicit to set the train/test split to an explicit extract from one or two dataset(s).
I tried it as follows
settings.get_split_params().set_split_explicit(train_selection,test_selection,
test_dataset_name="UpcomingData_Toronto")
Not sure what is expected out of train_selection and test_selection arguments.
In the documentation it is given as below.train_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the train dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
test_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the test dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()
What should I provide as arguments for train_selection and test_selection in the set_split_explicit ?