Is there a good way to take a large input file and split it into more manageable files, ideally by size?
For example, if we are given one huge 40GB .csv.gz file in S3, is there a recipe which will allow us to read it in spark and split it into 250MB .csv.gz files written back to S3?
One option for splitting input data would be to use a Sync recipe to redispatch the dataset. This will allow you to "split" the dataset into separate output files based on the partition value. However, re-dispatching cannot be run on the Spark engine, so that might not be an option for you here.
If you need to stick with Spark due to the data size, you'll probably need to stick with a spark code recipe like PySpark, and perform the split / writing within the the recipe.
Another option would be to use a Python recipe that reads and writes out the data in chunks so that you don't need to read in the data at once and can write out your file chunks. Here's an example that will read in 5000 rows at a time, and write out 5000-row output files to S3 by writing to a Managed folder that points to an S3 connection.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs input_dataset = dataiku.Dataset("INPUT_DATASET") counter = 1 outputfolder = dataiku.Folder("viRZ38Wu") for df in input_dataset.iter_dataframes(chunksize=5000): with outputfolder.get_writer("myoutputfile_" + str(counter) + ".csv") as w: # Write recipe outputs w.write(df.to_csv().encode('utf-8')) counter += 1
In S3, now I have many small files that contain a split of the data:
I hope this information is helpful. Let me know if you have any questions about this!