Hive partition question

vkrishnachandar
vkrishnachandar Registered Posts: 1 ✭✭✭

Hi

i have created a hive recipe which writes to a hive table.

we have created partition on sequence %{source}/%{country}/.* on the output hive table.

However in the hive script we gave SQL statement something like.

INSERT OVERWRITE TABLE <table_name> PARTITION (country='${country_name}',source='${source_name}') ....

We triggered it via scenario it seems to work. However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.

Is it mandatory to have out query defined something like shown below.

INSERT OVERWRITE TABLE <table_name> PARTITION (source='${source_name}',country='${country_name}')...

Regards

V.Krishna


Operating system used: linux

Tagged:

Answers

  • MikeG
    MikeG Dataiker, Registered Posts: 15 Dataiker
    edited July 17

    Hi @vkrishnachandar
    ,

    Thanks for posting.

    >However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.

    If the partitioning scheme in the output HDFS dataset looks like `%{source}/%{country}/.*` are you asking if it is legal to use `[...]PARTITION (country=‘${country_name}’,source=‘${source_name}’)[...]` in the Hive recipe (note: I have transposed the columns in the Hive recipe.)?

    Based on my own testing it is legal to use a different order of partitioning columns in the Hive recipe than is defined in the output HDFS dataset’s partitioning pattern. Note: The order specified in the output HDFS dataset’s partitioning pattern is what gets written to HDFS.

    Here’s an example:

    In the output HDFS dataset my partitioning pattern is `%{num_cylinder}/%{test_col}/.*`:

    Screen Shot 2022-05-06 at 7.41.13 PM.png

    In my Hive recipe my PARTITION statement switches the order by placing `test_col` first:

    Screen Shot 2022-05-06 at 10.11.24 PM.png

    After running the Hive recipe I look at the the file structure in HDFS and I observe its using the ordering I defined in the partitioning pattern of my output HDFS dataset ( `%{num_cylinder}/%{test_col}/.*`):

    [centos@hdp3 ~]$ hadoop fs -ls -R /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2
    drwxr-xr-x   - hive dataiku          0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4
    drwxr-xr-x   - hive dataiku          0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test
    -rw-r--r--   1 hive dataiku         12 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test/000000_0

    The output in Hive is correct:

    0: jdbc:hive2://hdp3.local:2181/default> select * from dataiku.output2;
    INFO  : Compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2
    INFO  : Semantic Analysis Completed (retrial = false)
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:output2.car_id, type:string, comment:null), FieldSchema(name:output2.num_cylinder, type:string, comment:null), FieldSchema(name:output2.test_col, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.074 seconds
    INFO  : Executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2
    INFO  : Completed executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.006 seconds
    INFO  : OK
    +-----------------+-----------------------+-------------------+
    | output2.car_id  | output2.num_cylinder  | output2.test_col  |
    +-----------------+-----------------------+-------------------+
    | 5               | 4                     | test              |
    | 6               | 4                     | test              |
    +-----------------+-----------------------+-------------------+
    2 rows selected (0.168 seconds)

    In DSS, the output dataset is correct (partition columns are omitted from exploring datasets:

    Screen Shot 2022-05-06 at 11.41.04 PM.png

    Here's the data in the input dataset:

     ~/Desktop$ cat test.csv
    car_id,num_cylinder,test_col
    1,6,"test"
    2,6,"test"
    3,8,"test"
    4,8,"test"
    5,4,"test"
    6,4,"test"

    Please let me know if I have misunderstood your question or if you observed behavior which made you curious about transposing columns in the PARTITION statement of your Hive script.

    Thank you,
    Mike

Setup Info
    Tags
      Help me…