Save PySpark DataFrame from Jupyter Notebook
Hi,
I prefer developping with Jupyter as I'm doing before with Zeppelin.
So what is the best way to save a PySpark Dataframe in my datasets??
I've tried to create an empty one and push the data in with:
out = dataiku.Dataset("dataset_empty")
dkuspark.write_with_schema(out, my_df)
But it's not working.
Thx guys.
Answers
-
Hi,
You shouldn't really be using a notebook to create datasets in DSS as this is not recommended nor best practice. Instead, if you wish to create a new dataset, you should be using a Pyspark recipe instead. The reason that this is the case is because a core concept of DSS is idempotence, where performing the same action will always lead to the same results. That is why things like creating new datasets and/or updating the output of a dataset should be handled through recipes that are incorporated in your Flow rather than through notebooks.
With that being said, you can find more information about creating new datasets and writing DataFrame results to said output via a Pyspark recipe in our documentation here:
https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html
Best,
Andrew
-
Ok, but I cannot call the libraries I have created in my project, I've got an error message...
DSS is for low level dev data-scientists... The documentation needs improvements.
Thank you for your help, best regards guys.
-
Hi, sorry to hear that, what is the error message exactly?