Hive partition question

vkrishnachandar Registered Posts: 1 ✭✭✭


i have created a hive recipe which writes to a hive table.

we have created partition on sequence %{source}/%{country}/.* on the output hive table.

However in the hive script we gave SQL statement something like.

INSERT OVERWRITE TABLE <table_name> PARTITION (country='${country_name}',source='${source_name}') ....

We triggered it via scenario it seems to work. However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.

Is it mandatory to have out query defined something like shown below.

INSERT OVERWRITE TABLE <table_name> PARTITION (source='${source_name}',country='${country_name}')...



Operating system used: linux



  • MikeG
    MikeG Dataiker, Registered Posts: 15 Dataiker

    Hi @vkrishnachandar

    Thanks for posting.

    >However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.

    If the partitioning scheme in the output HDFS dataset looks like `%{source}/%{country}/.*` are you asking if it is legal to use `[...]PARTITION (country=‘${country_name}’,source=‘${source_name}’)[...]` in the Hive recipe (note: I have transposed the columns in the Hive recipe.)?

    Based on my own testing it is legal to use a different order of partitioning columns in the Hive recipe than is defined in the output HDFS dataset’s partitioning pattern. Note: The order specified in the output HDFS dataset’s partitioning pattern is what gets written to HDFS.

    Here’s an example:

    In the output HDFS dataset my partitioning pattern is `%{num_cylinder}/%{test_col}/.*`:

    Screen Shot 2022-05-06 at 7.41.13 PM.png

    In my Hive recipe my PARTITION statement switches the order by placing `test_col` first:

    Screen Shot 2022-05-06 at 10.11.24 PM.png

    After running the Hive recipe I look at the the file structure in HDFS and I observe its using the ordering I defined in the partitioning pattern of my output HDFS dataset ( `%{num_cylinder}/%{test_col}/.*`):

    [centos@hdp3 ~]$ hadoop fs -ls -R /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2drwxr-xr-x - hive dataiku 0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4drwxr-xr-x - hive dataiku 0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test-rw-r--r-- 1 hive dataiku 12 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test/000000_0

    The output in Hive is correct:

    0: jdbc:hive2://hdp3.local:2181/default> select * from dataiku.output2;INFO : Compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2INFO : Semantic Analysis Completed (retrial = false)INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:output2.car_id, type:string, comment:null), FieldSchema(name:output2.num_cylinder, type:string, comment:null), FieldSchema(name:output2.test_col, type:string, comment:null)], properties:null)INFO : Completed compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.074 secondsINFO : Executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2INFO : Completed executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.006 secondsINFO : OK+-----------------+-----------------------+-------------------+| output2.car_id | output2.num_cylinder | output2.test_col |+-----------------+-----------------------+-------------------+| 5 | 4 | test || 6 | 4 | test |+-----------------+-----------------------+-------------------+2 rows selected (0.168 seconds)

    In DSS, the output dataset is correct (partition columns are omitted from exploring datasets:

    Screen Shot 2022-05-06 at 11.41.04 PM.png

    Here's the data in the input dataset:

    ~/Desktop$ cat test.csvcar_id,num_cylinder,test_col1,6,"test"2,6,"test"3,8,"test"4,8,"test"5,4,"test"6,4,"test"

    Please let me know if I have misunderstood your question or if you observed behavior which made you curious about transposing columns in the PARTITION statement of your Hive script.

    Thank you,

Setup Info
      Help me…