Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
How does one dump parquet files into a manage folder when the type is of DataFrame (pyspark df)?
Hi @Sreyasha,
You can achieve this by converting the Spark DataFrame to local Pandas DataFrame using toPandas
method and then simply use to_csv
. I've provided sample code below, which you can execute in a notebook.
import dataiku
import dataiku.spark as dkuspark
import pyspark
from pyspark.sql import SQLContext
# Load PySpark
sc = pyspark.SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
# Example: Read the descriptor of a Dataiku dataset
mydataset = dataiku.Dataset("Dataset1")
# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)
folder = dataiku.Folder("folder-id")
filename= "testfile.csv"
with folder.get_writer(filename) as w:
w.write(df.toPandas().to_csv().encode('utf-8'))
If you have any questions, please let us know.
Thanks,
Jordan