Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

Save spark dataset to .parquet in managed folder

sagar_dubey
Level 1
Level 1
Save spark dataset to .parquet in managed folder

Hi,

We have our managed folder created in S3 as part of our process we want to use a pyspark recipe to read dataset into spark dataset and perform basic operation and write multiple output files in parquet format and place them in different subfolders of managed folder.

We are able to perform the operations but were not able to write the output files in managed folder.

Is there a way to write a dataset directly to a managed folder in parquet format?

Could you please help us with the same?

Thanks a lot for your help!

 

Best, Sagar

0 Kudos
3 Replies
fchataigner2
Dataiker
Dataiker

Hi,

there is no integration of Spark dataframe read/write with DSS managed folders (I'm assuming you're using the dataframe API of Spark). A couple of remarks though:

- if you need to output the result of your Spark computations as parquet files, why not simply do a dataset in DSS? on the same S3 connection, and with Parquet format?

- you can always make a folder pointing to the files of another dataset: simply adjust the path in the managed folder's Settings > Connection

0 Kudos
sagar_dubey
Level 1
Level 1
Author

Thanks @fchataigner2  for the response.

In our case, we will slicing our main dataset into multiple datasets and we wanted to write these as the output files in managed folder(in different subfolder). In this case, user can create a dataset on desired file based on his need.

We don't want to write to a single dataset every time since all the outputs would be different and volume of data is high. 

0 Kudos
fchataigner2
Dataiker
Dataiker

Hi,

it sounds like you want a partitioned dataset as output, not a folder. The question being how you go from your unpartitioned input dataset to a partitioned one.

If the pattern is that each time you run the recipe, the output is re-done from scratch (all subfolders), then you should consider using a Sync recipe, to the partitioned dataset, with "redispatch" checked on (in that case the target partition name doesn't matter as long as it's not empty).

If the pattern is that each time you run the recipe, a new file is added in the output, in a new subfolder, then you can do a Spark recipe in which you use the output partition value to filter the input and produce the dataframe. DSS passes the partition value as variables to the SparkSQL or Scala code.

0 Kudos
A banner prompting to get Dataiku DSS