Control write partitioning with Spark

jmccartin
Level 3
Control write partitioning with Spark

There does not appear to be a way to write spark jobs to disk using a set partition scheme. This is normally done via dataframe.write.parquet(<path>, partitionBy=['year']), if one is to partition the data by year, for example. I am looking at the API page here: https://doc.dataiku.com/dss/latest/python-api/pyspark.html, specifically the function: write_with_schema



What are my options here? Since this is an important requirement for us, what's to stop me from simply using the sqlContext to write to a fixed path in HDFS, using the command I gave above? Can this be hacked somehow, or by using a plugin?

0 Kudos
3 Replies
Alex_Combessie
Dataiker Alumni

Hi,

In order to use partitioning in Dataiku, you need to specify it on the output (and possibly input) dataset. You can find more details on this page: https://doc.dataiku.com/dss/latest/partitions/fs_datasets.html 

If you set it up accordingly, this file system partitioning setup will be applied to all recipes, including those running on Spark.

Hope it helps,

Alex

0 Kudos
bkostya
Level 1

You may go to recipe -> Advanced -> Spark configuration and define
spark.sql.shuffle.partitions = 5

0 Kudos
Cartiernan
Level 1

Did you end up finding a workaround for this? In your stated example, DSS's Time Dimension Partitioning will work, but if you have a Discrete Dimension with many values there does not seem to be a way to handle running all of them.

Largely there doesn't seem to be wildcard support for discrete dimensions. 

0 Kudos

Labels

?
Labels (3)
A banner prompting to get Dataiku