Save PySpark DataFrame from Jupyter Notebook

Options
Tate_fr
Tate_fr Registered Posts: 8 ✭✭✭✭

Hi,

I prefer developping with Jupyter as I'm doing before with Zeppelin.

So what is the best way to save a PySpark Dataframe in my datasets??

I've tried to create an empty one and push the data in with:

out = dataiku.Dataset("dataset_empty")
dkuspark.write_with_schema(out, my_df)

But it's not working.

Thx guys.

Answers

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    Options

    Hi,

    You shouldn't really be using a notebook to create datasets in DSS as this is not recommended nor best practice. Instead, if you wish to create a new dataset, you should be using a Pyspark recipe instead. The reason that this is the case is because a core concept of DSS is idempotence, where performing the same action will always lead to the same results. That is why things like creating new datasets and/or updating the output of a dataset should be handled through recipes that are incorporated in your Flow rather than through notebooks.

    With that being said, you can find more information about creating new datasets and writing DataFrame results to said output via a Pyspark recipe in our documentation here:

    https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html

    Best,

    Andrew

  • Tate_fr
    Tate_fr Registered Posts: 8 ✭✭✭✭
    Options

    Ok, but I cannot call the libraries I have created in my project, I've got an error message...

    DSS is for low level dev data-scientists... The documentation needs improvements.

    Thank you for your help, best regards guys.

  • Mattsco
    Mattsco Dataiker, Registered Posts: 125 Dataiker
    Options

    Hi, sorry to hear that, what is the error message exactly?

Setup Info
    Tags
      Help me…