Partition usage in Pyspark/SparkSQL
sunith992
Dataiku DSS Core Designer, Registered Posts: 20 ✭✭✭✭
I have an input dataset which is partitioned (5 partitions) and i would like to use it all in Pyspark/SparkSQL, now all these partitions are used for grouping to get the overall count in Pyspark/SparkSQL then i would need a specific partition (out of these five) to report the partition along with overall count.
can anyone please help if there is any way to refer this specific partition as a Column (may be a partition identifier) from the code itself?.
while connecting with recipes it generally ask us the list of partitions to be used for input, but here i would need it to input all and use any one partition from all, which can also benefit in performance/efficiency as i am considering the partitioning method.