Dataiku Scenario

Parul_ch
Level 3
Dataiku Scenario

While creating a scenario for partitioning a dataset on RunID, I'm getting a KeyError: 'RunID'. How to rectify it.

Thanks,
Parul.

0 Kudos
6 Replies
AlexT
Dataiker

Hi,

Could provide some more details on your scenario?

1) Are you using a custom python step?

2) Can you share the exact code you are trying to use and the results in the Key Error? 

3) Is this, SQL, Filesystem dataset? 

Thanks,

 

0 Kudos
Parul_ch
Level 3
Author

Hi Alex

Yes, I'm using a custom python script.

The scenario that I'm creating: 

from dataiku.scenario import Scenario
import dataiku

scenario = Scenario()

partitions_dataset = dataiku.Dataset("dumpdata_raw")
partitions_list = list(partitions_dataset.get_dataframe()['RunID'].values)

# Building a dataset
scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

Attaching the error.

Thanks,

Parul. 

 

 

0 Kudos
AlexT
Dataiker

Hi Parul, The key error is because the dataset dumpdata_raw doesn't contain a column RunID. If this is a filesystem dataset please note the partitioning column is removed when re-dispatching.

You can list all partitions with partitions_list = dataset.list_partitions()

 


scenario = Scenario() d

dataset = dataiku.Dataset("input_dataset_name")

scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)

 

 

 

Let me know if that works for what you are trying to do.

0 Kudos
Parul_ch
Level 3
Author

Hi Alex,

Thanks for the RCA :), I've rectified my code to get the column: RunId in dataset dumpdata_raw.

Again created the scenario for reading the partitions, however, now I'm only able to get 1 partition (there are many more).

How to rectify it?

Thanks,
Parul.

0 Kudos
AlexT
Dataiker

Can you share the code you are using to read the partitions?

You will need to explicitly pass the partitions you want to read. 

Here is an example :https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/499...

0 Kudos
Parul_ch
Level 3
Author

Hi Alex, 

What I was initially doing was saving the dump files of all the runs in a single non-partitioned dataset, and then using a python recipe to make them partitioned by RunID. 

Now, I tried to do it in a different manner, I imported dump files directly into the partitioned dataset. However, while running the scenario, it failed with this error: Exception: No column in schema of Project.my dataset. Have you set up  the schema  for this dataset?

Thanks,

Parul.

0 Kudos