PySpark Notebook - Insert column with the Partition ID

PJK
PJK Partner, Registered Posts: 2 Partner

Hello,

I am working with a PySpark Notebook.

I have a partitioned Dataset and I would like to create a column in this Dataset with the partition ID value.

The result I want would be the same dataset without any partition but having a column "id_partition" that I can't get by importing the Dataset in the Notebook.

The goal is also not to manipulate the flow, only the Notebook.

Thanks in advance !

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron

    Maybe this part of the documentation might help:

    https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#listing-partitions

    But I thinks option is only available when you connect to a dataset with dataiku.Dataset. If you are using dataiku.spark.get_dataframe(sqlContext, dataset), I'm not sure what the solution could be.

    Hope this help a bit

  • PJK
    PJK Partner, Registered Posts: 2 Partner

    Thanks for your answer, I am very positive that I have to use the DSS Dataset library rather than some Spark function.

    However, I am really struggling using this library to go from :

    row1

    row2

    row3

    to :

    row1 | partition_name 1

    row2 | partition_name 1

    row3 | partition_name 2

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron

    Maybe it is something you can do before starting to work in the Pyspark notebook. There is thread where you can "enrich" your partitioned dataset with the partition ID or name:

    https://community.dataiku.com/t5/Using-Dataiku-DSS/How-to-parse-filename-for-file-based-datasets-partitions/m-p/9267

    In my case, I created a dataset using a connection to an HDFS set partitioned by 'day', and this is reflected on the path of the data: /home/data/day=Y-M-D/data.csv

    When creating the dataset, I didn't get a column with the 'day', so I used the "prepare recipe" as recommended in that ticket:

    Selection_323.png

     After running the recipe, I had the data columns, plus a column with the day:

    Selection_324.png

    If your case is similar, that might help. I couldn't find a solution using the pyspark or the dataiku API in python.

Setup Info
    Tags
      Help me…