Using two different datasets to train a model in lab using API

Mohammed · April 2024

Hi,
I'm trying to train a model in lab using the API from a notebook.
I'm using the below code to setup the ML task.I'm currently using the "MasterData" as my data. I want to use a different dataset "UpcomingData" as my test data and "MasterData" as train data ((Explicit extract from two dataset policy in design tab of lab).
How do I achieve this? What should be the code changes I need to make?.
From a previous question in community I understood that , I could use set_split_explicit on the mltask settings' split params obtained via get_split_params.
Thanks in advance!

mltask = p.create_prediction_ml_task(
    input_dataset="MasterData",
    target_variable="DV",
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='DEFAULT' # Template to use for setting default parameters
)

Operating system used: Windows

Alexandru · April 2024

Hi,
Indeed you would need to obtain and already created split params.

The easiest way is probably to create another MLTask manually via the UI, use get_settings, retrieve that, and apply the changes you need, e.g., a different dataset name or whole settings.get_raw()['splitParams'] from the pre-created model

After you create an analysis, you can use splitParams dict from another visual model.

import dataiku
client = dataiku.api_client()
p = client.get_project("TUT_EXTERNAL_MODELS")
#mltask = p.get_ml_task("<analysis_id>", "<mltask_id>")
mltask = p.get_ml_task("8Nkypr4s", "AtOStv1l")
settings = mltask.get_settings()
settings.get_raw()['splitParams']['eftdTrain']['datasetSmartName'] = "train_split_b"
settings.get_raw()['splitParams']['eftdTest']['datasetSmartName'] = "train_split_c"
settings.save()

Thanks

Mohammed · April 2024

@AlexT
, I didn't follow the solution you provided
I see a method set_split_explicit to set the train/test split to an explicit extract from one or two dataset(s).

I tried it as follows
settings.get_split_params().set_split_explicit(train_selection,test_selection,
test_dataset_name="UpcomingData_Toronto")
Not sure what is expected out of train_selection and test_selection arguments.
In the documentation it is given as below.

train_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the train dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()

test_selection (Union[DSSDatasetSelectionBuilder, dict]) – Builder or dict defining the settings of the extract for the test dataset. May be None (won’t be changed). A dict with the appropriate schema can be generated via dataikuapi.dss.utils.DSSDatasetSelectionBuilder.build()

What should I provide as arguments for train_selection and test_selection in the set_split_explicit ?

Using two different datasets to train a model in lab using API

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Using two different datasets to train a model in lab using API

Answers

Welcome!

Welcome!

Quick Links

Categories