Empty Hive table when processing with pyspark

Options
Houssam_2000
Houssam_2000 Dataiku DSS Core Designer, Registered Posts: 4

Hello,

I have a problem with a Hive table :

- when trying to process the table using Pyspark (ex : df.count() ) i get 0 rows which means an empty DataFrame.

- then when trying to investigate, using a Hive query (SELECT COUNT(*) FROM TABLE) i get all the data in that table.

Does anyone have a solution to that or knows why it behave like that ?

Thank you

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @Houssam_2000
    ,

    How are you creating the df? Please try to reproduce the issue in test PySpark recipe and share the job diagnostics over a support ticket.

    I couldn't reproduce the issue mentioned:


    import dataiku
    from dataiku import spark as dkuspark
    from pyspark import SparkContext
    from pyspark.sql import SQLContext

    sc = SparkContext.getOrCreate()
    sqlContext = SQLContext(sc)

    # Read recipe inputs
    dataset = dataiku.Dataset("dataset_name")
    df = dkuspark.get_dataframe(sqlContext, dataset)

    print(df.count())

    Return the expected number of rows in PySpark Notebook.

Setup Info
    Tags
      Help me…