Hive Big Data

Houssam_2000
Houssam_2000 Dataiku DSS Core Designer, Registered Posts: 4

Hello,

i am facing an issue for a while now with a big table stored in Hive, i am trying to read it using multiple method but none works (it keeps loading with no result) :

- i tryed using normal python pandas but had no success, then used an sql pre filter on a date column to get only max(date) rows, used spark but failed.

does anyone have an idea on how to succesfully read this dataset knowing that i need to keep only the max(date) rows. i appreciate your responses.

Thank you and have a good day.

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron

    We need more details to help you out. Post details about the dataset, how many rows and how many columns. Post your Python code. Post the different errors you get when trying to read it using different methods. Thanks

  • Houssam_2000
    Houssam_2000 Dataiku DSS Core Designer, Registered Posts: 4
    edited July 17

    Thanks for your reply, here is more details about the data :
    - it is a Hive table with 280 columns and 100M lines.

    - trying to read the table using pandas :

    #Read recipe inputs
    my_data = dataiku.Dataset("my_dataset")
    my_data_df = my_data .get_dataframe()

    - after that, i tried to filter the data on Setting -> Connection -> SQL Query using the SQL command :

    SELECT * FROM my_table
    WHERE date_creation=(SELECT max(date_creation) FROM my_table)

    when trying to run the python recipe with this prefilter, the log is stuck in this message for hours :

    [11:29:00] [INFO] [dip.input.sql] running compute_9e3LzF5i_NP - Executing detect statement : SELECT * FROM (SELECT * FROM my_table
    WHERE date_creation==(SELECT max(date_creation) FROM my_table)) `subQuery`
    LIMIT 1

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron

    That is a very big table, are you sure your server has enough RAM to handle this data? When you read with pandas you are effectively putting all the data in the result set in the pandas frame. Can you please paste the error messages you get?

Setup Info
    Tags
      Help me…