Dataiku Scenario

Options
Parul_ch
Parul_ch Partner, Registered Posts: 34 Partner

While creating a scenario for partitioning a dataset on RunID, I'm getting a KeyError: 'RunID'. How to rectify it.

Thanks,
Parul.

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi,

    Could provide some more details on your scenario?

    1) Are you using a custom python step?

    2) Can you share the exact code you are trying to use and the results in the Key Error?

    3) Is this, SQL, Filesystem dataset?

    Thanks,

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner
    Options

    Hi Alex

    Yes, I'm using a custom python script.

    The scenario that I'm creating:

    from dataiku.scenario import Scenario
    import dataiku

    scenario = Scenario()

    partitions_dataset = dataiku.Dataset("dumpdata_raw")
    partitions_list = list(partitions_dataset.get_dataframe()['RunID'].values)

    # Building a dataset
    scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

    Attaching the error.

    Thanks,

    Parul.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi Parul, The key error is because the dataset dumpdata_raw doesn't contain a column RunID. If this is a filesystem dataset please note the partitioning column is removed when re-dispatching.

    You can list all partitions with partitions_list = dataset.list_partitions()


    scenario = Scenario() d

    dataset = dataiku.Dataset("input_dataset_name")

    scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

    Let me know if that works for what you are trying to do.

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner
    Options

    Hi Alex,

    Thanks for the RCA :), I've rectified my code to get the column: RunId in dataset dumpdata_raw.

    Again created the scenario for reading the partitions, however, now I'm only able to get 1 partition (there are many more).

    How to rectify it?

    Thanks,
    Parul.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Can you share the code you are using to read the partitions?

    You will need to explicitly pass the partitions you want to read.

    Here is an example :https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/4995

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner
    Options

    Hi Alex,

    What I was initially doing was saving the dump files of all the runs in a single non-partitioned dataset, and then using a python recipe to make them partitioned by RunID.

    Now, I tried to do it in a different manner, I imported dump files directly into the partitioned dataset. However, while running the scenario, it failed with this error: Exception: No column in schema of Project.my dataset. Have you set up the schema for this dataset?

    Thanks,

    Parul.

Setup Info
    Tags
      Help me…