Writing Pyspark Dataframes into a managed folder

Options
Sreyasha
Sreyasha Registered Posts: 1

How does one dump parquet files into a manage folder when the type is of DataFrame (pyspark df)?

Answers

  • JordanB
    JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 293 Dataiker
    Options

    Hi @Sreyasha
    ,

    You can achieve this by converting the Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv. I've provided sample code below, which you can execute in a notebook.

    import dataikuimport dataiku.spark as dkusparkimport pysparkfrom pyspark.sql import SQLContext# Load PySparksc = pyspark.SparkContext.getOrCreate()sqlContext = SQLContext(sc)# Example: Read the descriptor of a Dataiku datasetmydataset = dataiku.Dataset("Dataset1")# And read it as a Spark dataframedf = dkuspark.get_dataframe(sqlContext, mydataset)folder = dataiku.Folder("folder-id")filename= "testfile.csv"with folder.get_writer(filename) as w:w.write(df.toPandas().to_csv().encode('utf-8'))

    If you have any questions, please let us know.

    Thanks,

    Jordan

Setup Info
    Tags
      Help me…