Writing Pyspark Dataframes into a managed folder

Sreyasha
Level 1
Writing Pyspark Dataframes into a managed folder

How does one dump parquet files into a manage folder when the type is of DataFrame (pyspark df)?

0 Kudos
1 Reply
JordanB
Dataiker

Hi @Sreyasha,

You can achieve this by converting the Spark DataFrame to local Pandas DataFrame  using toPandas method and then simply use to_csv. I've provided sample code below, which you can execute in a notebook.

import dataiku
import dataiku.spark as dkuspark
import pyspark
from pyspark.sql import SQLContext

# Load PySpark
sc = pyspark.SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

# Example: Read the descriptor of a Dataiku dataset
mydataset = dataiku.Dataset("Dataset1")

# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)


folder = dataiku.Folder("folder-id")

filename= "testfile.csv"
with folder.get_writer(filename) as w:
    w.write(df.toPandas().to_csv().encode('utf-8'))

 

If you have any questions, please let us know.

Thanks,

Jordan

 

0 Kudos