Splitting very large files

JonB123 Registered Posts: 1 ✭✭✭


Is there a good way to take a large input file and split it into more manageable files, ideally by size?

For example, if we are given one huge 40GB .csv.gz file in S3, is there a recipe which will allow us to read it in spark and split it into 250MB .csv.gz files written back to S3?



  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 315 Dataiker
    edited July 17

    Hi @JonB123

    One option for splitting input data would be to use a Sync recipe to redispatch the dataset. This will allow you to "split" the dataset into separate output files based on the partition value. However, re-dispatching cannot be run on the Spark engine, so that might not be an option for you here.

    If you need to stick with Spark due to the data size, you'll probably need to stick with a spark code recipe like PySpark, and perform the split / writing within the the recipe.

    Another option would be to use a Python recipe that reads and writes out the data in chunks so that you don't need to read in the data at once and can write out your file chunks. Here's an example that will read in 5000 rows at a time, and write out 5000-row output files to S3 by writing to a Managed folder that points to an S3 connection.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    # Read recipe inputs
    input_dataset = dataiku.Dataset("INPUT_DATASET")
    counter = 1
    outputfolder = dataiku.Folder("viRZ38Wu")
    for df in input_dataset.iter_dataframes(chunksize=5000):
        with outputfolder.get_writer("myoutputfile_" + str(counter) + ".csv") as w:
            # Write recipe outputs
        counter += 1 

    In S3, now I have many small files that contain a split of the data:

    Screen Shot 2021-09-06 at 3.16.28 PM.png

    I hope this information is helpful. Let me know if you have any questions about this!


Setup Info
      Help me…