Split Dataset Into Files of n Records Each, Deliver via SFTP

MossandRoy Dataiku DSS Core Designer, Registered Posts: 8

I have a business requirement to deliver a dataset via SFTP, but the dataset must be split into files each containing n records. There can be a different number of records each time, which means there can be a different number of files each time. This requirement is based on a size restriction of records per file in the destination system and is not something I can control.

In my dataset I have created a column called GroupNumber which I can use to split the data. The first n records have a value of '1', the next n records have '2', etc...

It seems as if a managed file or folder on the SFTP connection would only output the entire dataset without the option to split it. I've been searching these boards and turning over this puzzle in my mind. I'm unsure of how to solve this. My mind keeps coming back to Python as the most likely solution, but I've only scratched the surface of Python and not really used it at all with Dataiku, so I don't know if it's the right tool for the job or not.

Thank you in advance for any help you can offer in solving this problem. I don't need someone to do this for me, but would appreciate being pointed in the right direction.


Operating system used: Windows 11 Enterprise


Best Answer

  • MossandRoy
    MossandRoy Dataiku DSS Core Designer, Registered Posts: 8
    Answer ✓

    It may not be pretty, and I'm currently going to a non-SFTP managed folder, but here is the code I've got working! Feedback is most definitely welcome!

    %pylab inline# -*- coding: utf-8 -*-import dataikufrom dataiku import pandasutils as pduimport pandas as pd# Import the class that allows us to execute SQL on the Studio connectionsfrom dataiku.core.sql import SQLExecutor2# Get a handle on the input datasetgroupeddata = dataiku.Dataset("MY_DATASET")# Write recipe outputsoutfolder = dataiku.Folder("FOLDER_ID")# We create an executor. We pass to it the dataset instance. This way, the# executor knows which SQL database should be targetedexecutor = SQLExecutor2(dataset=groupeddata)# Find the max value of the column to find out how many files we're working withmax_files = executor.query_to_df("""SELECT MAX(GROUP_COLUMN) AS MaxGroupNumberFROM MY_TABLE""")for index, maxGroup in max_files.itertuples():maxGroups = maxGroupfor group in range (1, maxGroups + 1):outData_df = executor.query_to_df(f"""SELECT MY_COLUMNS_TO_SENDFROM MY_TABLEWHERE GROUP_COLUMN = {group}""")filename = f"Test_{group}.txt"outfolder.upload_data(filename, outData_df.to_csv(index=False).encode("utf-8"))


Setup Info
      Help me…