Survey banner
The Dataiku Community is moving to a new home! Some short term disruption starting next week: LEARN MORE

Using two different datasets to train a model in lab using API

MNOP
Level 3
Using two different datasets to train a model in lab using API

Hi, 
I'm trying to train a model in lab using the API from a notebook.
I'm using the below code to setup the ML task.I'm currently using the "MasterData" as my data. I want to use a different dataset "UpcomingData" as my test data and "MasterData" as train data ((Explicit extract from two dataset policy in design tab of lab). 
How do I achieve this? What should be the code changes I need to make?.
From a previous question in community I understood that , I could use set_split_explicit on the mltask settings' split params obtained via get_split_params
Thanks in advance!

mltask = p.create_prediction_ml_task(
    input_dataset="MasterData",
    target_variable="DV",
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='DEFAULT' # Template to use for setting default parameters
)

Operating system used: Windows

 

0 Kudos
2 Replies
AlexT
Dataiker

Hi,
Indeed you would need to obtain and already created split params.

The easiest way is probably to create another MLTask manually via the UI, use get_settings, retrieve that, and apply the changes you need, e.g., a different dataset name or whole settings.get_raw()['splitParams'] from the pre-created model

After you create an analysis, you can use splitParams dict from another visual model. 

import dataiku
client = dataiku.api_client()
p = client.get_project("TUT_EXTERNAL_MODELS")
#mltask = p.get_ml_task("<analysis_id>", "<mltask_id>")
mltask = p.get_ml_task("8Nkypr4s", "AtOStv1l")
settings = mltask.get_settings()
settings.get_raw()['splitParams']['eftdTrain']['datasetSmartName'] = "train_split_b"
settings.get_raw()['splitParams']['eftdTest']['datasetSmartName'] = "train_split_c"
settings.save()

Thanks

0 Kudos
MNOP
Level 3
Author

@AlexT , I didn't follow the solution you provided 
I see a method set_split_explicit to set the train/test split to an explicit extract from one or two dataset(s). 

I tried it as follows 
settings.get_split_params().set_split_explicit(train_selection,test_selection,
                                                         test_dataset_name="UpcomingData_Toronto") 
Not sure what is expected out of train_selection and test_selection arguments. 
In the documentation it is given as below.

 

0 Kudos