Community Conundrum 25: Feature Visualization is now live! Read More

Save PySpark DataFrame from Jupyter Notebook

Level 2
Save PySpark DataFrame from Jupyter Notebook

Hi,

I prefer developping with Jupyter as I'm doing before with Zeppelin.

So what is the best way to save a PySpark Dataframe in my datasets??

I've tried to create an empty one and push the data in with:

out = dataiku.Dataset("dataset_empty")
dkuspark.write_with_schema(out,  my_df)

But it's not working.

Thx guys.

 

 

0 Kudos
3 Replies
Dataiker
Dataiker

Hi,

You shouldn't really be using a notebook to create datasets in DSS as this is not recommended nor best practice. Instead, if you wish to create a new dataset, you should be using a Pyspark recipe instead. The reason that this is the case is because a core concept of DSS is idempotence, where performing the same action will always lead to the same results. That is why things like creating new datasets and/or updating the output of a dataset should be handled through recipes that are incorporated in your Flow rather than through notebooks. 

With that being said, you can find more information about creating new datasets and writing DataFrame results to said output via a Pyspark recipe in our documentation here:

https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html

Best,

Andrew 

Level 2
Author

Ok, but I cannot call the libraries I have created in my project, I've got an error message...

DSS is for low level dev data-scientists... The documentation needs improvements.

Thank you for your help, best regards guys.

0 Kudos
Dataiker
Dataiker

Hi, sorry to hear that, what is the error message exactly?

 

Mattsco
0 Kudos