Empty Hive table when processing with pyspark

Houssam_2000 · ‎02-29-2024

Hello,

I have a problem with a Hive table :

- when trying to process the table using Pyspark (ex : df.count() ) i get 0 rows which means an empty DataFrame.

- then when trying to investigate, using a Hive query (SELECT COUNT(*) FROM TABLE) i get all the data in that table.

Does anyone have a solution to that or knows why it behave like that ?

Thank you

AlexT · ‎03-03-2024

Hi @Houssam_2000 ,

How are you creating the df? Please try to reproduce the issue in test PySpark recipe and share the job diagnostics over a support ticket.

I couldn't reproduce the issue mentioned:

import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

# Read recipe inputs
dataset = dataiku.Dataset("dataset_name")
df = dkuspark.get_dataframe(sqlContext, dataset)

print(df.count())

Return the expected number of rows in PySpark Notebook.

Sign up to take part

Empty Hive table when processing with pyspark

Empty Hive table when processing with pyspark

Setup info