Community

Partition usage in Pyspark/SparkSQL

I have an input dataset which is partitioned (5 partitions) and i would like to use it all in Pyspark/SparkSQL, now all these partitions are used for grouping to get the overall count in Pyspark/SparkSQL then i would need a specific partition (out of these five) to report the partition along with overall count.

can anyone please help if there is any way to refer this specific partition as a Column (may be a partition identifier) from the code itself?.

while connecting with recipes it generally ask us the list of partitions to be used for input, but here i would need it to input all and use any one partition from all, which can also benefit in performance/efficiency as i am considering the partitioning method.

0 Replies

never-displayed

You must be signed in to add attachments

never-displayed

Additional options

Associated Products

Sign up to take part

Partition usage in Pyspark/SparkSQL

Partition usage in Pyspark/SparkSQL

Setup info