Keep data ordering after pyspark

Options
scholaschl
scholaschl Dataiku DSS Core Concepts, Registered Posts: 8 ✭✭✭

Hello,

I would like to sort a table with a column in ascending order and then apply a Pyspark code from this new table.
To do this, I'm using a Sort recipe and the problem is that when I use my Pyspark recipe, it doesn't keep the order of my columns.

What could be the solution?
I saw that there was a Write ordering option to keep the order physically.
Where is this option activated? In the parameters of the output dataset during the sort recipe?

Thanks you in advance

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer Posts: 753 Dataiker
    Answer ✓
    Options

    Hi,

    Datasets do not really have a concept of ordering. This is particularly true when using Pyspark, which is a distributed computation engine and does not manage ordering. Managing ordering goes significantly against distribution.

    You could add a sort statemetn in Spark at the end of your Pyspark recipe (but before writing the output). Please note that it will significantly slow down your Pyspark recipe.

    You can also reapply a sort recipe after the Pyspark recipe, but this requires re-reading and re-writing the data.

Setup Info
    Tags
      Help me…