Writing Pyspark Dataframes into a managed folder

Options
Sreyasha
Sreyasha Registered Posts: 1

How does one dump parquet files into a manage folder when the type is of DataFrame (pyspark df)?

Answers

  • JordanB
    JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 293 Dataiker
    edited July 17
    Options

    Hi @Sreyasha
    ,

    You can achieve this by converting the Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv. I've provided sample code below, which you can execute in a notebook.

    import dataiku
    import dataiku.spark as dkuspark
    import pyspark
    from pyspark.sql import SQLContext
    
    # Load PySpark
    sc = pyspark.SparkContext.getOrCreate()
    sqlContext = SQLContext(sc)
    
    # Example: Read the descriptor of a Dataiku dataset
    mydataset = dataiku.Dataset("Dataset1")
    
    # And read it as a Spark dataframe
    df = dkuspark.get_dataframe(sqlContext, mydataset)
    
    
    folder = dataiku.Folder("folder-id")
    
    filename= "testfile.csv"
    with folder.get_writer(filename) as w:
        w.write(df.toPandas().to_csv().encode('utf-8'))

    If you have any questions, please let us know.

    Thanks,

    Jordan

Setup Info
    Tags
      Help me…