Empty Hive table when processing with pyspark
Hello,
I have a problem with a Hive table :
- when trying to process the table using Pyspark (ex : df.count() ) i get 0 rows which means an empty DataFrame.
- then when trying to investigate, using a Hive query (SELECT COUNT(*) FROM TABLE) i get all the data in that table.
Does anyone have a solution to that or knows why it behave like that ?
Thank you
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
Hi @Houssam_2000
,How are you creating the df? Please try to reproduce the issue in test PySpark recipe and share the job diagnostics over a support ticket.
I couldn't reproduce the issue mentioned:
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContextsc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)# Read recipe inputs
dataset = dataiku.Dataset("dataset_name")
df = dkuspark.get_dataframe(sqlContext, dataset)print(df.count())
Return the expected number of rows in PySpark Notebook.