Schema Change and Keep Data

Darren_Gage · February 2020

Hi

I'd like to clarify the best way to update an existing recipe to a new column to an existing output dataset that contains historical data and I want to add a new column but retain the existing data.

Example:

Joining 2 datasets with a Join recipe, currently pull through 2 columns from each dataset.
Decide to pull through a 3rd column from one dataset to make 5 but want to keep all the existing data.

The usual "Schema change" message prompts for the output dataset to "Drop and recreate" but this obviously drops the data. Can this be deselected or would this cause issues?

I appreciate the slightly longer process would be to create a new Join with the 5 columns and then Stack both old and new datasets together, is this the preferred technique?

Thanks

antonstam · February 2020

Why do you want to retain the data? Would it take too long to re-compute the values, or are the original values no longer available in the input datasets? Where does the data reside?

Assuming you have a SQL table, you can process the change manually in the database, using ALTER TABLE and UPDATE commands. You can use SQL notebooks to interface directly with the database. Afterwards, you can load the updated schema in the Dataset.

If that's not an option, you can do some Flow wizardry, for example:

Create a new step after your current 4-column table where you join in the 5th column
Decouple your current 4-col table so that it becomes an orphan in your Flow and create a brand new table to take its place, and then use a Stack or Join recipe that combines their data

Darren_Gage · February 2020

Could take too long to reload as using flat files currently and data stored in HDFS only right now (this could be improved given time). Agree re the Stack option - cheers

Schema Change and Keep Data

Answers

Categories

Setup Info

Tags