Create Partitions on hive data.

Options
sasidharp
sasidharp Registered Posts: 27 ✭✭✭✭

I have a Hive Dataset consisting RAW DATA with timestamp in string type

hive.PNG RAW DATA

Data table in it is as below

data.PNG

 I want to partition this table based on the time stamp column. to the day level.into parquet files.

how can we do it?

Best Answer

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Answer ✓
    Options

    Hi @sasidharp
    ,

    Sorry for the delayed answer. Found here a recording of the process I followed to do what you are asking for (a video is worth 1000 pictures and a picture is worth...):

    There is a couple of things to notice:

    1) The dataset is a copy of your screenshot, so there is very few data!

    2) When I create the preparation recipe to parse the data, in the output I choose an HDFS dataset (later it will be synchronized automatically to the Hive metastore), and then I select CSV (Hive): this is only the case because our hadoop ecosystem doesn't have installed the parquet libraries. But you should be able to select Parquet instead.

    3) When I run the recipe, you can see that I select a random specific day. For some reason I can't explain (some dataikuer around that could help?) that is enough to generate all the partitions for all days in your source data. You can check that at the end.

    If something fails, I'm not sure I would be able to help, and I would recommend to ask Dataiku directly for support. With HDFS and Hive there are always specificities for each installation.

    Hope this helps!

Answers

  • GMR
    GMR Registered Posts: 2 ✭✭✭✭
    Options

    1) GO to connections and click on activate partitioning

    2)In Configure dimensions select Dimension type as Time Range

    3) In Range Choose Day

    In SQL Query change date format

    Eg: select *
    from example
    where DATE_FORMAT(effective_dt, 'yyyy-MM-dd') = '${day}'

    In Query listing partitions:

    SELECT day FROM (
    SELECT DATE_FORMAT(effective_dt, 'yyyy-MM-dd') as day FROM example) subq group by day;

    In Partition for preview

    give one date eg: 2017-12-31

  • sasidharp
    sasidharp Registered Posts: 27 ✭✭✭✭
    Options

    The Query is not giving any data

    query.PNG

    the time_stamp is in string format, what should we do to change it to date ?

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    So, just to be sure, besides partitioning the data at the day level, you want to move it into parquet files? Or just do the DSS partition?

    In any case the solution given by @GMR
    is perfect, once you deal with the Date parsing issue: you have to transform first the time_stamp string into a Date.

  • sasidharp
    sasidharp Registered Posts: 27 ✭✭✭✭
    Options

    I want to partition the table and move them into paraquet files,

    if i use prepare recipe to parse the date, can i give the output as partitioned dataset? stored to hdfs managed folder.

Setup Info
    Tags
      Help me…