Dataiku Scenario
While creating a scenario for partitioning a dataset on RunID, I'm getting a KeyError: 'RunID'. How to rectify it.
Thanks,
Parul.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
Could provide some more details on your scenario?
1) Are you using a custom python step?
2) Can you share the exact code you are trying to use and the results in the Key Error?
3) Is this, SQL, Filesystem dataset?
Thanks,
-
Hi Alex
Yes, I'm using a custom python script.
The scenario that I'm creating:
from dataiku.scenario import Scenario
import dataikuscenario = Scenario()
partitions_dataset = dataiku.Dataset("dumpdata_raw")
partitions_list = list(partitions_dataset.get_dataframe()['RunID'].values)# Building a dataset
scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)Attaching the error.
Thanks,
Parul.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi Parul, The key error is because the dataset dumpdata_raw doesn't contain a column RunID. If this is a filesystem dataset please note the partitioning column is removed when re-dispatching.
You can list all partitions with partitions_list = dataset.list_partitions()
scenario = Scenario() ddataset = dataiku.Dataset("input_dataset_name")
scenario.build_dataset("All_raw_dumps", partitions=",".join(partitions_list), build_mode='NON_RECURSIVE_FORCED_BUILD',fail_fatal=False)
Let me know if that works for what you are trying to do.
-
Hi Alex,
Thanks for the RCA , I've rectified my code to get the column: RunId in dataset dumpdata_raw.
Again created the scenario for reading the partitions, however, now I'm only able to get 1 partition (there are many more).
How to rectify it?
Thanks,
Parul. -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Can you share the code you are using to read the partitions?
You will need to explicitly pass the partitions you want to read.
Here is an example :https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/4995 -
Hi Alex,
What I was initially doing was saving the dump files of all the runs in a single non-partitioned dataset, and then using a python recipe to make them partitioned by RunID.
Now, I tried to do it in a different manner, I imported dump files directly into the partitioned dataset. However, while running the scenario, it failed with this error: Exception: No column in schema of Project.my dataset. Have you set up the schema for this dataset?
Thanks,
Parul.