Join us at the Everyday AI Conference in London, New York & Bengaluru! REGISTER NOW

Hive partition question

vkrishnachandar
Level 1
Hive partition question

Hi 

i have created a hive recipe which writes to a hive table. 

we have created partition on sequence %{source}/%{country}/.* on the output hive table. 

However in the hive script we gave SQL statement something like. 

INSERT OVERWRITE TABLE <table_name> PARTITION (country='${country_name}',source='${source_name}') ....

We triggered it via scenario it seems to work. However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.

Is it mandatory to have out query defined something like shown below. 

INSERT OVERWRITE TABLE <table_name> PARTITION (source='${source_name}',country='${country_name}')...

 Regards

V.Krishna

 

 

 

 

 

 


Operating system used: linux

0 Kudos
1 Reply
MikeG
Dataiker
Dataiker

Hi @vkrishnachandar ,

Thanks for posting.

>However just want to make sure is it really mandatory to have PARTITION dimension follow same sequence as given under hive dataset partition.

If the partitioning scheme in the output HDFS dataset looks like `%{source}/%{country}/.*` are you asking if it is legal to use `[...]PARTITION (country=‘${country_name}’,source=‘${source_name}’)[...]` in the Hive recipe (note: I have transposed the columns in the Hive recipe.)?

Based on my own testing it is legal to use a different order of partitioning columns in the Hive recipe than is defined in the output HDFS dataset’s partitioning pattern. Note: The order specified in the output HDFS dataset’s partitioning pattern is what gets written to HDFS.

Here’s an example:

In the output HDFS dataset my partitioning pattern is `%{num_cylinder}/%{test_col}/.*`:

Screen Shot 2022-05-06 at 7.41.13 PM.png

In my Hive recipe my PARTITION statement switches the order by placing `test_col` first:

Screen Shot 2022-05-06 at 10.11.24 PM.png

After running the Hive recipe I look at the the file structure in HDFS and I observe its using the ordering I defined in the partitioning pattern of my output HDFS dataset ( `%{num_cylinder}/%{test_col}/.*`):

 

 

[centos@hdp3 ~]$ hadoop fs -ls -R /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2
drwxr-xr-x   - hive dataiku          0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4
drwxr-xr-x   - hive dataiku          0 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test
-rw-r--r--   1 hive dataiku         12 2022-05-07 02:35 /user/dataiku/dss_managed_datasets/COMMUNITY_24946/output2/num_cylinder=4/test_col=test/000000_0

 

 

The output in Hive is correct:

 

 

0: jdbc:hive2://hdp3.local:2181/default> select * from dataiku.output2;
INFO  : Compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:output2.car_id, type:string, comment:null), FieldSchema(name:output2.num_cylinder, type:string, comment:null), FieldSchema(name:output2.test_col, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.074 seconds
INFO  : Executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7): select * from dataiku.output2
INFO  : Completed executing command(queryId=hive_20220507051315_1b868675-e01b-45f1-b924-c45f8a1cb8c7); Time taken: 0.006 seconds
INFO  : OK
+-----------------+-----------------------+-------------------+
| output2.car_id  | output2.num_cylinder  | output2.test_col  |
+-----------------+-----------------------+-------------------+
| 5               | 4                     | test              |
| 6               | 4                     | test              |
+-----------------+-----------------------+-------------------+
2 rows selected (0.168 seconds)

 

 

In DSS, the output dataset is correct (partition columns are omitted from exploring datasets:

Screen Shot 2022-05-06 at 11.41.04 PM.png

Here's the data in the input dataset:

 

 

 ~/Desktop$ cat test.csv
car_id,num_cylinder,test_col
1,6,"test"
2,6,"test"
3,8,"test"
4,8,"test"
5,4,"test"
6,4,"test"

 

 

Please let me know if I have misunderstood your question or if you observed behavior which made you curious about transposing columns in the PARTITION statement of your Hive script.

Thank you,
Mike

0 Kudos