Save PySpark DataFrame from Jupyter Notebook

Tate_fr · ‎04-03-2020

Hi,

I prefer developping with Jupyter as I'm doing before with Zeppelin.

So what is the best way to save a PySpark Dataframe in my datasets??

I've tried to create an empty one and push the data in with:

out = dataiku.Dataset("dataset_empty")
dkuspark.write_with_schema(out, my_df)

But it's not working.

Thx guys.

ATsao · ‎04-03-2020

Hi,

You shouldn't really be using a notebook to create datasets in DSS as this is not recommended nor best practice. Instead, if you wish to create a new dataset, you should be using a Pyspark recipe instead. The reason that this is the case is because a core concept of DSS is idempotence, where performing the same action will always lead to the same results. That is why things like creating new datasets and/or updating the output of a dataset should be handled through recipes that are incorporated in your Flow rather than through notebooks.

With that being said, you can find more information about creating new datasets and writing DataFrame results to said output via a Pyspark recipe in our documentation here:

https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html

Best,

Andrew

Tate_fr · ‎04-04-2020

Ok, but I cannot call the libraries I have created in my project, I've got an error message...

DSS is for low level dev data-scientists... The documentation needs improvements.

Thank you for your help, best regards guys.

Mattsco · ‎04-04-2020

Hi, sorry to hear that, what is the error message exactly?

Mattsco

Sign up to take part

Save PySpark DataFrame from Jupyter Notebook

Save PySpark DataFrame from Jupyter Notebook