Hive Big Data

Houssam_2000 · ‎01-30-2024

Hello,

i am facing an issue for a while now with a big table stored in Hive, i am trying to read it using multiple method but none works (it keeps loading with no result) :

- i tryed using normal python pandas but had no success, then used an sql pre filter on a date column to get only max(date) rows, used spark but failed.

does anyone have an idea on how to succesfully read this dataset knowing that i need to keep only the max(date) rows. i appreciate your responses.

Thank you and have a good day.

Turribeach · ‎01-30-2024

We need more details to help you out. Post details about the dataset, how many rows and how many columns. Post your Python code. Post the different errors you get when trying to read it using different methods. Thanks

Houssam_2000 · ‎01-30-2024

Thanks for your reply, here is more details about the data :
- it is a Hive table with 280 columns and 100M lines.

- trying to read the table using pandas :

#Read recipe inputs
my_data = dataiku.Dataset("my_dataset")
my_data_df = my_data .get_dataframe()

- after that, i tried to filter the data on Setting -> Connection -> SQL Query using the SQL command :

SELECT * FROM my_table
WHERE date_creation=(SELECT max(date_creation) FROM my_table)

when trying to run the python recipe with this prefilter, the log is stuck in this message for hours :

[11:29:00] [INFO] [dip.input.sql] running compute_9e3LzF5i_NP - Executing detect statement : SELECT * FROM (SELECT * FROM my_table
WHERE date_creation==(SELECT max(date_creation) FROM my_table)) `subQuery`
LIMIT 1

Turribeach · ‎01-30-2024

That is a very big table, are you sure your server has enough RAM to handle this data? When you read with pandas you are effectively putting all the data in the result set in the pandas frame. Can you please paste the error messages you get?

Sign up to take part

Hive Big Data

Hive Big Data

Setup info