Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Partition usage in Pyspark/SparkSQL

sunith992
Level 3
Partition usage in Pyspark/SparkSQL

I have an input dataset which is partitioned (5 partitions) and i would like to use it all in Pyspark/SparkSQL, now all these partitions are used for grouping to get the overall count in Pyspark/SparkSQL then i would need a specific partition (out of these five) to report the partition along with overall count.

can anyone please help if there is any way to refer this specific partition as a Column (may be a partition identifier) from the code itself?.

while connecting with recipes it generally ask us the list of partitions to be used for input, but here i would need it to input all and use any one partition from all, which can also benefit in performance/efficiency as i am considering the partitioning method.   

0 Kudos
0 Replies

Setup info

?
Tags (1)
A banner prompting to get Dataiku