Update schema of existing hive partitioned tables (HDFS)

farhanromli · ‎02-02-2024

Currently, my workplace has two servers for Designer nodes, one for "lower environment" and another one for "production".

When we want to deploy to Prod, we will create the project bundle from the lower env and import to Prod designer node.

Getting to my question:

In the lower env, when there is an existing partitioned table that i need to update schema by adding new columns, I will go to the hive recipe and add the new column and validate the recipe. (see attachment manual hive recipe.png). I will untick "drop and recreate" and only tick "synchronize metastore".

However, when importing to Prod, I am not allowed to manually do the same process. We only trigger the Scenario and when it reaches the hive recipe, this is somehow mangaed by DSS automatically.

Synchronization script : 
CREATE DATABASE IF NOT EXISTS `db_name`;
 USE `db_name`;
 DROP TABLE table_name;
CREATE EXTERNAL TABLE table_name(fields....)

From the log (snippet above), it seems the table is actually dropped and recreated. Normally i dont have issue with this method but I do have concern for the tables that have large number of partitions. Because I noticed that since it was recreated, it need to resync the partitions as below and this results in long time taken as it need to run the ALTER TABLE for all partitions

ALTER TABLE table_name DROP IF EXISTS PARTITION

As most of my tables have daily partition, this would significantly increase the number of partitions over time.

My question is, is there other way for me to update the schema in Prod without having DSS drop the table and resync the partitions? I was thinking I could execute the ALTER TABLE as seen below before running the scenario in Prod

ALTER TABLE table_name ADD COLUMNS (column _name datatypes);

But I am not sure if this will work in Prod and I am not sure how to test this in lower environment.

p.s I am not that technically knowledgable on Hive/HDFS, most of the times I will let DSS do many things automatically.

Operating system used: Windows 10

farhanromli · ‎02-02-2024

After posting this, I found that it seems someone has posted similar question before

Update partitioned table schema without dropping old partitions - Dataiku Community

I am not sure if it is really the exact some scenario but I dont fully understand the response to the above question and I would also prefer not having to use Dataiku API. Hence I will keep this question open.

Sign up to take part

Update schema of existing hive partitioned tables (HDFS)

Update schema of existing hive partitioned tables (HDFS)

Setup info