Dataiku Scenario

Parul_ch
Parul_ch Partner, Registered Posts: 34 Partner

While creating a scenario for partitioning a dataset on RunID, I'm getting a KeyError: 'RunID'. How to rectify it.

Thanks,
Parul.

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Hi,

    Could provide some more details on your scenario?

    1) Are you using a custom python step?

    2) Can you share the exact code you are trying to use and the results in the Key Error?

    3) Is this, SQL, Filesystem dataset?

    Thanks,

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner

    Hi Alex

    Yes, I'm using a custom python script.

    The scenario that I'm creating:

    from dataiku.scenario import Scenario
    import dataiku

    scenario = Scenario()

    partitions_dataset = dataiku.Dataset("dumpdata_raw")
    partitions_list = list(partitions_dataset.get_dataframe()['RunID'].values)

    # Building a dataset
    scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

    Attaching the error.

    Thanks,

    Parul.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Hi Parul, The key error is because the dataset dumpdata_raw doesn't contain a column RunID. If this is a filesystem dataset please note the partitioning column is removed when re-dispatching.

    You can list all partitions with partitions_list = dataset.list_partitions()


    scenario = Scenario() d

    dataset = dataiku.Dataset("input_dataset_name")

    scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

    Let me know if that works for what you are trying to do.

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner

    Hi Alex,

    Thanks for the RCA :), I've rectified my code to get the column: RunId in dataset dumpdata_raw.

    Again created the scenario for reading the partitions, however, now I'm only able to get 1 partition (there are many more).

    How to rectify it?

    Thanks,
    Parul.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Can you share the code you are using to read the partitions?

    You will need to explicitly pass the partitions you want to read.

    Here is an example :https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/4995

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner

    Hi Alex,

    What I was initially doing was saving the dump files of all the runs in a single non-partitioned dataset, and then using a python recipe to make them partitioned by RunID.

    Now, I tried to do it in a different manner, I imported dump files directly into the partitioned dataset. However, while running the scenario, it failed with this error: Exception: No column in schema of Project.my dataset. Have you set up the schema for this dataset?

    Thanks,

    Parul.

Setup Info
    Tags
      Help me…