Using two different datasets to train a model in lab using API

Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 44 ✭✭✭
edited July 2024 in Using Dataiku

Hi,
I'm trying to train a model in lab using the API from a notebook.
I'm using the below code to setup the ML task.I'm currently using the "MasterData" as my data. I want to use a different dataset "UpcomingData" as my test data and "MasterData" as train data ((Explicit extract from two dataset policy in design tab of lab).
How do I achieve this? What should be the code changes I need to make?.
From a previous question in community I understood that , I could use set_split_explicit on the mltask settings' split params obtained via get_split_params.
Thanks in advance!

mltask = p.create_prediction_ml_task(
    input_dataset="MasterData",
    target_variable="DV",
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='DEFAULT' # Template to use for setting default parameters
)

Operating system used: Windows

Answers

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,270 Dataiker

    Hi,
    Indeed you would need to obtain and already created split params.

    The easiest way is probably to create another MLTask manually via the UI, use get_settings, retrieve that, and apply the changes you need, e.g., a different dataset name or whole settings.get_raw()['splitParams'] from the pre-created model

    After you create an analysis, you can use splitParams dict from another visual model.

    import dataiku
    client = dataiku.api_client()
    p = client.get_project("TUT_EXTERNAL_MODELS")
    #mltask = p.get_ml_task("<analysis_id>", "<mltask_id>")
    mltask = p.get_ml_task("8Nkypr4s", "AtOStv1l")
    settings = mltask.get_settings()
    settings.get_raw()['splitParams']['eftdTrain']['datasetSmartName'] = "train_split_b"
    settings.get_raw()['splitParams']['eftdTest']['datasetSmartName'] = "train_split_c"
    settings.save()

    Thanks

  • Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 44 ✭✭✭

    @AlexT
    , I didn't follow the solution you provided
    I see a method set_split_explicit to set the train/test split to an explicit extract from one or two dataset(s).

    I tried it as follows
    settings.get_split_params().set_split_explicit(train_selection,test_selection,
    test_dataset_name="UpcomingData_Toronto")
    Not sure what is expected out of train_selection and test_selection arguments.
    In the documentation it is given as below.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.