Writing Pyspark Dataframes into a managed folder
Sreyasha
Registered Posts: 1 ✭
How does one dump parquet files into a manage folder when the type is of DataFrame (pyspark df)?
Answers
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 297 Dataiker
Hi @Sreyasha
,You can achieve this by converting the Spark DataFrame to local Pandas DataFrame using
toPandas
method and then simply useto_csv
. I've provided sample code below, which you can execute in a notebook.import dataiku import dataiku.spark as dkuspark import pyspark from pyspark.sql import SQLContext # Load PySpark sc = pyspark.SparkContext.getOrCreate() sqlContext = SQLContext(sc) # Example: Read the descriptor of a Dataiku dataset mydataset = dataiku.Dataset("Dataset1") # And read it as a Spark dataframe df = dkuspark.get_dataframe(sqlContext, mydataset) folder = dataiku.Folder("folder-id") filename= "testfile.csv" with folder.get_writer(filename) as w: w.write(df.toPandas().to_csv().encode('utf-8'))
If you have any questions, please let us know.
Thanks,
Jordan