Dataiku Scenario

Parul_ch · May 2022

While creating a scenario for partitioning a dataset on RunID, I'm getting a KeyError: 'RunID'. How to rectify it.

Thanks,
Parul.

Alexandru · May 2022

Hi,

Could provide some more details on your scenario?

1) Are you using a custom python step?

2) Can you share the exact code you are trying to use and the results in the Key Error?

3) Is this, SQL, Filesystem dataset?

Thanks,

Parul_ch · May 2022

Hi Alex

Yes, I'm using a custom python script.

The scenario that I'm creating:

from dataiku.scenario import Scenario
import dataiku

scenario = Scenario()

partitions_dataset = dataiku.Dataset("dumpdata_raw")
partitions_list = list(partitions_dataset.get_dataframe()['RunID'].values)

# Building a dataset
scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

Attaching the error.

Thanks,

Parul.

Alexandru · May 2022

Hi Parul, The key error is because the dataset dumpdata_raw doesn't contain a column RunID. If this is a filesystem dataset please note the partitioning column is removed when re-dispatching.

You can list all partitions with partitions_list = dataset.list_partitions()

scenario = Scenario() d

dataset = dataiku.Dataset("input_dataset_name")

scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

Let me know if that works for what you are trying to do.

Parul_ch · May 2022

Hi Alex,

Thanks for the RCA , I've rectified my code to get the column: RunId in dataset dumpdata_raw.

Again created the scenario for reading the partitions, however, now I'm only able to get 1 partition (there are many more).

How to rectify it?

Thanks,
Parul.

Alexandru · May 2022

Can you share the code you are using to read the partitions?

You will need to explicitly pass the partitions you want to read.

Here is an example :https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/4995

Parul_ch · May 2022

Hi Alex,

What I was initially doing was saving the dump files of all the runs in a single non-partitioned dataset, and then using a python recipe to make them partitioned by RunID.

Now, I tried to do it in a different manner, I imported dump files directly into the partitioned dataset. However, while running the scenario, it failed with this error: Exception: No column in schema of Project.my dataset. Have you set up the schema for this dataset?

Thanks,

Parul.

Dataiku Scenario

Answers

Categories

Setup Info

Tags