Control write partitioning with Spark

Registered Posts: 19 ✭✭✭✭

There does not appear to be a way to write spark jobs to disk using a set partition scheme. This is normally done via dataframe.write.parquet(<path>, partitionBy=['year']), if one is to partition the data by year, for example. I am looking at the API page here: https://doc.dataiku.com/dss/latest/python-api/pyspark.html, specifically the function: write_with_schema .

What are my options here? Since this is an important requirement for us, what's to stop me from simply using the sqlContext to write to a fixed path in HDFS, using the command I gave above? Can this be hacked somehow, or by using a plugin?

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.