Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Update partitioned table schema without dropping old partitions

lnguyen
Level 2
Update partitioned table schema without dropping old partitions

Hi community,

I work on a data flow with the last datasets of the flows partitioned. (HDFS on Hadoop)

Everytime I made a change in the flow that change the schema of the last datasets, the tables is dropped and I have to recalculate all partitions of the last datasets which take a lot of times and ressources.

Is still ok now since I'm on the design phase of the project but once we launch the project on production, we risk to have to recalculate years of data and the impact will be unbearable.

Is there anyway to update the schema of the partitioned dataset by adding for example the new column but with empty values to that field for all olds partitions and only recalculate the selected partitions?

I guess what I'm asking if there is a way to update the schema of a partitioned dataset without dropping the whole dataset?

Thanks a lot for your help


Operating system used: Windows 10

0 Kudos
0 Replies