Hive partition question
Hi
i have created a hive recipe which writes to a hive table.
we have created partition on sequence %{source}/%{country}/.* on the output hive table.
However in the hive script we gave SQL statement something like.
INSERT OVERWRITE TABLE <table_name> PARTITION (country='${country_name}',source='${source_name}') ....
We triggered it via scenario it seems to work. However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.
Is it mandatory to have out query defined something like shown below.
INSERT OVERWRITE TABLE <table_name> PARTITION (source='${source_name}',country='${country_name}')...
Regards
V.Krishna
Operating system used: linux
Answers
-
Hi @vkrishnachandar
,Thanks for posting.
>However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.
If the partitioning scheme in the output HDFS dataset looks like `%{source}/%{country}/.*` are you asking if it is legal to use `[...]PARTITION (country=‘${country_name}’,source=‘${source_name}’)[...]` in the Hive recipe (note: I have transposed the columns in the Hive recipe.)?
Based on my own testing it is legal to use a different order of partitioning columns in the Hive recipe than is defined in the output HDFS dataset’s partitioning pattern. Note: The order specified in the output HDFS dataset’s partitioning pattern is what gets written to HDFS.
Here’s an example:
In the output HDFS dataset my partitioning pattern is `%{num_cylinder}/%{test_col}/.*`:
In my Hive recipe my PARTITION statement switches the order by placing `test_col` first:
After running the Hive recipe I look at the the file structure in HDFS and I observe its using the ordering I defined in the partitioning pattern of my output HDFS dataset ( `%{num_cylinder}/%{test_col}/.*`):
[centos@hdp3 ~]$ hadoop fs -ls -R /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2 drwxr-xr-x - hive dataiku 0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4 drwxr-xr-x - hive dataiku 0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test -rw-r--r-- 1 hive dataiku 12 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test/000000_0
The output in Hive is correct:
0: jdbc:hive2://hdp3.local:2181/default> select * from dataiku.output2; INFO : Compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2 INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:output2.car_id, type:string, comment:null), FieldSchema(name:output2.num_cylinder, type:string, comment:null), FieldSchema(name:output2.test_col, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.074 seconds INFO : Executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2 INFO : Completed executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.006 seconds INFO : OK +-----------------+-----------------------+-------------------+ | output2.car_id | output2.num_cylinder | output2.test_col | +-----------------+-----------------------+-------------------+ | 5 | 4 | test | | 6 | 4 | test | +-----------------+-----------------------+-------------------+ 2 rows selected (0.168 seconds)
In DSS, the output dataset is correct (partition columns are omitted from exploring datasets:
Here's the data in the input dataset:
~/Desktop$ cat test.csv car_id,num_cylinder,test_col 1,6,"test" 2,6,"test" 3,8,"test" 4,8,"test" 5,4,"test" 6,4,"test"
Please let me know if I have misunderstood your question or if you observed behavior which made you curious about transposing columns in the PARTITION statement of your Hive script.
Thank you,
Mike