Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I work on a data flow with the last datasets of the flows partitioned. (HDFS on Hadoop)
Everytime I made a change in the flow that change the schema of the last datasets, the tables is dropped and I have to recalculate all partitions of the last datasets which take a lot of times and ressources.
Is still ok now since I'm on the design phase of the project but once we launch the project on production, we risk to have to recalculate years of data and the impact will be unbearable.
Is there anyway to update the schema of the partitioned dataset by adding for example the new column but with empty values to that field for all olds partitions and only recalculate the selected partitions?
I guess what I'm asking if there is a way to update the schema of a partitioned dataset without dropping the whole dataset?
Thanks a lot for your help
Operating system used: Windows 10
I have managed to do this using dataiku's API (see below screenshot for the change in the number of columns across partitions).
Here is a piece of the code that I used to get the above result:
The dataiku module has the dictionary dku_flow_variables available when a Python recipe is run from the flow itself.
If the same code is run outside of the flow, dku_flow_variables won't be available.
if hasattr(dataiku, 'dku_flow_variables'):
client = dataiku.api_client()
project = client.get_project(PROJECT_NAME)
input_dataset = dataiku.Dataset(INPUT_DATASET_NAME)
output_dataset_api = project.get_dataset(OUTPUT_DATASET_NAME)
output_dataset = output_dataset_api.get_as_core_dataset()
df = input_dataset.get_dataframe(infer_with_pandas=False)
# the following line is unecessary inside a recipe (but not in a notebook!) as partition setting is configured by the flow
# https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html#dataiku.Dataset.set_write_partition :
# --> "Setting the write partition is not allowed in Python recipes, where write is controlled by the Flow."
# see also : https://doc.dataiku.com/dss/latest/partitions/variables.html#python
# --> "Since read and write is done through Dataiku DSS, you don’t need to specify the source or destination partitions in your code for that, using “get_dataframe()” will automatically give you only the relevant partitions.
# For other purposes than reading/writing dataframes, all variables are available in a dictionary called dku_flow_variables in the dataiku module."
output_dataset_api.compute_metrics() # compute metrics on whole dataset
output_dataset_api.compute_metrics(partition=PARTITION_NAME) # compute metrics on partition
Hope it will solve your problem!