Keep data ordering after pyspark
Hello,
I would like to sort a table with a column in ascending order and then apply a Pyspark code from this new table.
To do this, I'm using a Sort recipe and the problem is that when I use my Pyspark recipe, it doesn't keep the order of my columns.
What could be the solution?
I saw that there was a Write ordering option to keep the order physically.
Where is this option activated? In the parameters of the output dataset during the sort recipe?
Thanks you in advance
Best Answer
-
Hi,
Datasets do not really have a concept of ordering. This is particularly true when using Pyspark, which is a distributed computation engine and does not manage ordering. Managing ordering goes significantly against distribution.
You could add a sort statemetn in Spark at the end of your Pyspark recipe (but before writing the output). Please note that it will significantly slow down your Pyspark recipe.
You can also reapply a sort recipe after the Pyspark recipe, but this requires re-reading and re-writing the data.