Control write partitioning with Spark

jmccartin Registered Posts: 19 ✭✭✭✭

There does not appear to be a way to write spark jobs to disk using a set partition scheme. This is normally done via dataframe.write.parquet(<path>, partitionBy=['year']), if one is to partition the data by year, for example. I am looking at the API page here:, specifically the function: write_with_schema .

What are my options here? Since this is an important requirement for us, what's to stop me from simply using the sqlContext to write to a fixed path in HDFS, using the command I gave above? Can this be hacked somehow, or by using a plugin?


  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭


    In order to use partitioning in Dataiku, you need to specify it on the output (and possibly input) dataset. You can find more details on this page:

    If you set it up accordingly, this file system partitioning setup will be applied to all recipes, including those running on Spark.

    Hope it helps,


  • bkostya
    bkostya Registered Posts: 4 ✭✭✭

    You may go to recipe -> Advanced -> Spark configuration and define
    spark.sql.shuffle.partitions = 5

  • Cartiernan
    Cartiernan Dataiku DSS Core Concepts, Registered Posts: 1 ✭✭✭

    Did you end up finding a workaround for this? In your stated example, DSS's Time Dimension Partitioning will work, but if you have a Discrete Dimension with many values there does not seem to be a way to handle running all of them.

    Largely there doesn't seem to be wildcard support for discrete dimensions.

Setup Info
      Help me…