Split Dataset Into Files of n Records Each, Deliver via SFTP

MossandRoy
MossandRoy Dataiku DSS Core Designer, Registered Posts: 8

I have a business requirement to deliver a dataset via SFTP, but the dataset must be split into files each containing n records. There can be a different number of records each time, which means there can be a different number of files each time. This requirement is based on a size restriction of records per file in the destination system and is not something I can control.

In my dataset I have created a column called GroupNumber which I can use to split the data. The first n records have a value of '1', the next n records have '2', etc...

It seems as if a managed file or folder on the SFTP connection would only output the entire dataset without the option to split it. I've been searching these boards and turning over this puzzle in my mind. I'm unsure of how to solve this. My mind keeps coming back to Python as the most likely solution, but I've only scratched the surface of Python and not really used it at all with Dataiku, so I don't know if it's the right tool for the job or not.

Thank you in advance for any help you can offer in solving this problem. I don't need someone to do this for me, but would appreciate being pointed in the right direction.

Cheers!


Operating system used: Windows 11 Enterprise

Tagged:

Best Answer

  • MossandRoy
    MossandRoy Dataiku DSS Core Designer, Registered Posts: 8
    edited July 2024 Answer ✓

    It may not be pretty, and I'm currently going to a non-SFTP managed folder, but here is the code I've got working! Feedback is most definitely welcome!

    %pylab inline
    
    # -*- coding: utf-8 -*-
    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    
    # Import the class that allows us to execute SQL on the Studio connections
    from dataiku.core.sql import SQLExecutor2
    
    # Get a handle on the input dataset
    groupeddata = dataiku.Dataset("MY_DATASET")
    
    # Write recipe outputs
    outfolder = dataiku.Folder("FOLDER_ID")
    
    # We create an executor. We pass to it the dataset instance. This way, the
    # executor knows which SQL database should be targeted
    executor = SQLExecutor2(dataset=groupeddata)
    
    # Find the max value of the column to find out how many files we're working with
    max_files = executor.query_to_df("""
    SELECT MAX(GROUP_COLUMN) AS MaxGroupNumber
    FROM MY_TABLE
    """)
    
    for index, maxGroup in max_files.itertuples():
    maxGroups = maxGroup
    
    for group in range (1, maxGroups + 1):
    
    outData_df = executor.query_to_df(f"""
    SELECT MY_COLUMNS_TO_SEND
    FROM MY_TABLE
    WHERE GROUP_COLUMN = {group}
    """)
    
    filename = f"Test_{group}.txt"
    
    outfolder.upload_data(filename, outData_df.to_csv(index=False).encode("utf-8"))

Answers

Setup Info
    Tags
      Help me…