Redispatching partitioning dataset

Solved!
Bader
Level 3
Redispatching partitioning dataset

Is it possible to use spark-engine after checking redispatching partitioning ? if yes, please describe the configuration. 

 

1 Solution
ATsao
Dataiker

Hi Bader,

It is important to first note that the redispatch partitioning feature takes a dataset that is non-partitioned and splits it into different partitions, so that each file(s) of a partition only contains rows pertaining to that partition. In DSS, you can redispatch partition using either a Sync or Prepare recipe. However, these recipes are not Spark-compatible in this mode. 

If you wish to use Spark, you would need to write the code yourself by leveraging the Spark APIs through some kind of Spark code recipe, such as Spark-Scala or Pyspark for example. In short, the code would need to handle the splitting by making use of the partitionBy() method in Spark and writing to the partitioned output dataset accordingly. It is important to note as well that the parallelization on the cluster will be constrained by the level of parallelization that Spark can achieve using the input, non-partitioned dataset, such as the number of files in the input and the format that is being used (Parquet and Orc being parallelizable but not CSV). 

I hope that this helps!

Best,

Andrew

View solution in original post

3 Replies
ATsao
Dataiker

Hi Bader,

It is important to first note that the redispatch partitioning feature takes a dataset that is non-partitioned and splits it into different partitions, so that each file(s) of a partition only contains rows pertaining to that partition. In DSS, you can redispatch partition using either a Sync or Prepare recipe. However, these recipes are not Spark-compatible in this mode. 

If you wish to use Spark, you would need to write the code yourself by leveraging the Spark APIs through some kind of Spark code recipe, such as Spark-Scala or Pyspark for example. In short, the code would need to handle the splitting by making use of the partitionBy() method in Spark and writing to the partitioned output dataset accordingly. It is important to note as well that the parallelization on the cluster will be constrained by the level of parallelization that Spark can achieve using the input, non-partitioned dataset, such as the number of files in the input and the format that is being used (Parquet and Orc being parallelizable but not CSV). 

I hope that this helps!

Best,

Andrew

Bader
Level 3
Author

Many thanks Ataso,  

Yes, My source and target dataset are not partitioned. Tha's why I use redispach feature., Unfortunately, i got "DSS went of memory "

Lets say I would use spark-scala code recipe, could you please show an example in how to partition, assuming my partition is discrete values

 

Thanks

Bader

0 Kudos
Bader
Level 3
Author

Hi @ATsao 

I have tried below code, it works but i didn't populate the dataset in dataiku. 

I have done the following :

1- Create spark-scala recipe 

2- in the spark-scala code 

df.write.format("parquet").partitionBy("column").mode("append").saveAsTable("dataiku.tname)

3-I have check the target dataset, it has no data. 

 

0 Kudos