Create Partitions on hive data.
I have a Hive Dataset consisting RAW DATA with timestamp in string type
RAW DATA
Data table in it is as below
I want to partition this table based on the time stamp column. to the day level.into parquet files.
how can we do it?
Best Answer
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hi @sasidharp
,Sorry for the delayed answer. Found here a recording of the process I followed to do what you are asking for (a video is worth 1000 pictures and a picture is worth...):
There is a couple of things to notice:
1) The dataset is a copy of your screenshot, so there is very few data!
2) When I create the preparation recipe to parse the data, in the output I choose an HDFS dataset (later it will be synchronized automatically to the Hive metastore), and then I select CSV (Hive): this is only the case because our hadoop ecosystem doesn't have installed the parquet libraries. But you should be able to select Parquet instead.
3) When I run the recipe, you can see that I select a random specific day. For some reason I can't explain (some dataikuer around that could help?) that is enough to generate all the partitions for all days in your source data. You can check that at the end.
If something fails, I'm not sure I would be able to help, and I would recommend to ask Dataiku directly for support. With HDFS and Hive there are always specificities for each installation.
Hope this helps!
Answers
-
1) GO to connections and click on activate partitioning
2)In Configure dimensions select Dimension type as Time Range
3) In Range Choose Day
In SQL Query change date format
Eg: select *
from example
where DATE_FORMAT(effective_dt, 'yyyy-MM-dd') = '${day}'In Query listing partitions:
SELECT day FROM (
SELECT DATE_FORMAT(effective_dt, 'yyyy-MM-dd') as day FROM example) subq group by day;In Partition for preview
give one date eg: 2017-12-31
-
The Query is not giving any data
the time_stamp is in string format, what should we do to change it to date ?
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
So, just to be sure, besides partitioning the data at the day level, you want to move it into parquet files? Or just do the DSS partition?
In any case the solution given by @GMR
is perfect, once you deal with the Date parsing issue: you have to transform first the time_stamp string into a Date. -
I want to partition the table and move them into paraquet files,
if i use prepare recipe to parse the date, can i give the output as partitioned dataset? stored to hdfs managed folder.