Empty Hive table when processing with pyspark

Houssam_2000
Level 2
Empty Hive table when processing with pyspark

Hello,

I have a problem with a Hive table :

- when trying to process the table using Pyspark (ex : df.count() ) i get 0 rows which means an empty DataFrame.

- then when trying to investigate, using a Hive query (SELECT COUNT(*) FROM TABLE) i get all the data in that table.

 

Does anyone have a solution to that or knows why it behave like that ?

Thank you 

0 Kudos
1 Reply
AlexT
Dataiker

Hi @Houssam_2000 ,

How are you creating the df? Please try to reproduce the issue in test PySpark recipe and share the job diagnostics over a support ticket. 

I couldn't reproduce the issue mentioned:


import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

# Read recipe inputs
dataset = dataiku.Dataset("dataset_name")
df = dkuspark.get_dataframe(sqlContext, dataset)

print(df.count())

Return the expected number of rows in PySpark Notebook. 

0 Kudos