Save PySpark DataFrame from Jupyter Notebook

Tate_fr · April 2020

Hi,

I prefer developping with Jupyter as I'm doing before with Zeppelin.

So what is the best way to save a PySpark Dataframe in my datasets??

I've tried to create an empty one and push the data in with:

out = dataiku.Dataset("dataset_empty")
dkuspark.write_with_schema(out, my_df)

But it's not working.

Thx guys.

ATsao · April 2020

Hi,

You shouldn't really be using a notebook to create datasets in DSS as this is not recommended nor best practice. Instead, if you wish to create a new dataset, you should be using a Pyspark recipe instead. The reason that this is the case is because a core concept of DSS is idempotence, where performing the same action will always lead to the same results. That is why things like creating new datasets and/or updating the output of a dataset should be handled through recipes that are incorporated in your Flow rather than through notebooks.

With that being said, you can find more information about creating new datasets and writing DataFrame results to said output via a Pyspark recipe in our documentation here:

https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html

Best,

Andrew

Tate_fr · April 2020

Ok, but I cannot call the libraries I have created in my project, I've got an error message...

DSS is for low level dev data-scientists... The documentation needs improvements.

Thank you for your help, best regards guys.

Mattsco · April 2020

Hi, sorry to hear that, what is the error message exactly?

Save PySpark DataFrame from Jupyter Notebook

Answers

Categories

Setup Info

Tags