Using two different datasets to train a model in lab using API

Mohammed
Mohammed Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 43 ✭✭✭
edited July 16 in Using Dataiku

Hi,
I'm trying to train a model in lab using the API from a notebook.
I'm using the below code to setup the ML task.I'm currently using the "MasterData" as my data. I want to use a different dataset "UpcomingData" as my test data and "MasterData" as train data ((Explicit extract from two dataset policy in design tab of lab).
How do I achieve this? What should be the code changes I need to make?.
From a previous question in community I understood that , I could use set_split_explicit on the mltask settings' split params obtained via get_split_params.
Thanks in advance!

mltask = p.create_prediction_ml_task(
    input_dataset="MasterData",
    target_variable="DV",
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='DEFAULT' # Template to use for setting default parameters
)

Operating system used: Windows

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi,
    Indeed you would need to obtain and already created split params.

    The easiest way is probably to create another MLTask manually via the UI, use get_settings, retrieve that, and apply the changes you need, e.g., a different dataset name or whole settings.get_raw()['splitParams'] from the pre-created model

    After you create an analysis, you can use splitParams dict from another visual model.

    import dataiku
    client = dataiku.api_client()
    p = client.get_project("TUT_EXTERNAL_MODELS")
    #mltask = p.get_ml_task("<analysis_id>", "<mltask_id>")
    mltask = p.get_ml_task("8Nkypr4s", "AtOStv1l")
    settings = mltask.get_settings()
    settings.get_raw()['splitParams']['eftdTrain']['datasetSmartName'] = "train_split_b"
    settings.get_raw()['splitParams']['eftdTest']['datasetSmartName'] = "train_split_c"
    settings.save()

    Thanks

  • Mohammed
    Mohammed Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 43 ✭✭✭

    @AlexT
    , I didn't follow the solution you provided
    I see a method set_split_explicit to set the train/test split to an explicit extract from one or two dataset(s).

    I tried it as follows
    settings.get_split_params().set_split_explicit(train_selection,test_selection,
    test_dataset_name="UpcomingData_Toronto")
    Not sure what is expected out of train_selection and test_selection arguments.
    In the documentation it is given as below.

Setup Info
    Tags
      Help me…