Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

Create Partitions on hive data.

Level 3
Create Partitions on hive data.

I have a Hive Dataset consisting RAW DATA with timestamp in string type

 

hive.PNG  RAW DATA

 

Data table in it is as below

data.PNG

 I want to partition this table based on the time stamp column. to the day level.into parquet files.

 

how can we do it?

 

0 Kudos
5 Replies
Level 1

1) GO to connections and click on activate partitioning

2)In Configure dimensions select Dimension type as Time Range

3) In Range Choose Day

In SQL Query change date format

Eg: select *
from example
where DATE_FORMAT(effective_dt, 'yyyy-MM-dd') = '${day}' 

In Query listing partitions:

SELECT day FROM (
SELECT DATE_FORMAT(effective_dt, 'yyyy-MM-dd') as day FROM example) subq group by day;

In Partition for preview

give one date eg: 2017-12-31

 

 

Level 3
Author

The Query is not giving any data

query.PNG

 

the time_stamp is in string format, what  should we do to change it to date ?

0 Kudos
Level 5

So, just to be sure, besides partitioning the data at the day level, you want to move it into parquet files? Or just do the DSS partition?

In any case the solution given by @GMR is perfect, once you deal with the Date parsing issue: you have to transform first the time_stamp string into a Date.

 

0 Kudos
Level 3
Author

I want to partition the table and move them into paraquet files,

 

if i use prepare recipe to parse the date, can i give the output as partitioned dataset? stored to hdfs managed folder.

0 Kudos
Level 5

Hi @sasidharp ,

Sorry for the delayed answer. Found here a recording of the process I followed to do what you are asking for (a video is worth 1000 pictures and a picture is worth...):

There is a couple of things to notice:

1) The dataset is a copy of your screenshot, so there is very few data!

2) When I create the preparation recipe to parse the data, in the output I choose an HDFS dataset (later it will be synchronized automatically to the Hive metastore), and then I select CSV (Hive): this is only the case because our hadoop ecosystem doesn't have installed the parquet libraries. But you should be able to select Parquet instead.

3) When I run the recipe, you can see that I select a random specific day. For some reason I can't explain (some dataikuer around that could help?) that is enough to generate all the partitions for all days in your source data. You can check that at the end.

If something fails, I'm not sure I would be able to help, and I would recommend to ask Dataiku directly for support. With HDFS and Hive there are always specificities for each installation.

Hope this helps!