Partition usage in Pyspark/SparkSQL

sunith992 · July 2022

I have an input dataset which is partitioned (5 partitions) and i would like to use it all in Pyspark/SparkSQL, now all these partitions are used for grouping to get the overall count in Pyspark/SparkSQL then i would need a specific partition (out of these five) to report the partition along with overall count.

can anyone please help if there is any way to refer this specific partition as a Column (may be a partition identifier) from the code itself?.

while connecting with recipes it generally ask us the list of partitions to be used for input, but here i would need it to input all and use any one partition from all, which can also benefit in performance/efficiency as i am considering the partitioning method.

Partition usage in Pyspark/SparkSQL

Categories

Setup Info

Tags