Split Dataset Into Files of n Records Each, Deliver via SFTP
I have a business requirement to deliver a dataset via SFTP, but the dataset must be split into files each containing n records. There can be a different number of records each time, which means there can be a different number of files each time. This requirement is based on a size restriction of records per file in the destination system and is not something I can control.
In my dataset I have created a column called GroupNumber which I can use to split the data. The first n records have a value of '1', the next n records have '2', etc...
It seems as if a managed file or folder on the SFTP connection would only output the entire dataset without the option to split it. I've been searching these boards and turning over this puzzle in my mind. I'm unsure of how to solve this. My mind keeps coming back to Python as the most likely solution, but I've only scratched the surface of Python and not really used it at all with Dataiku, so I don't know if it's the right tool for the job or not.
Thank you in advance for any help you can offer in solving this problem. I don't need someone to do this for me, but would appreciate being pointed in the right direction.
Cheers!
Operating system used: Windows 11 Enterprise
Best Answer
-
It may not be pretty, and I'm currently going to a non-SFTP managed folder, but here is the code I've got working! Feedback is most definitely welcome!
%pylab inline # -*- coding: utf-8 -*- import dataiku from dataiku import pandasutils as pdu import pandas as pd # Import the class that allows us to execute SQL on the Studio connections from dataiku.core.sql import SQLExecutor2 # Get a handle on the input dataset groupeddata = dataiku.Dataset("MY_DATASET") # Write recipe outputs outfolder = dataiku.Folder("FOLDER_ID") # We create an executor. We pass to it the dataset instance. This way, the # executor knows which SQL database should be targeted executor = SQLExecutor2(dataset=groupeddata) # Find the max value of the column to find out how many files we're working with max_files = executor.query_to_df(""" SELECT MAX(GROUP_COLUMN) AS MaxGroupNumber FROM MY_TABLE """) for index, maxGroup in max_files.itertuples(): maxGroups = maxGroup for group in range (1, maxGroups + 1): outData_df = executor.query_to_df(f""" SELECT MY_COLUMNS_TO_SEND FROM MY_TABLE WHERE GROUP_COLUMN = {group} """) filename = f"Test_{group}.txt" outfolder.upload_data(filename, outData_df.to_csv(index=False).encode("utf-8"))
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Welcome to the Dataiku community we are glad you have joined us.
I have a similar need which I has recently come up that I have not worked out.
If you can program a bit in Python, I’m fairly sure that you can save your files to a local managed folder and then sync to the SFTP server.
There may however be a more cleaver way to get this done. Anyone out there have any ideas?
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
You might find this set of instructions to be a helpful starting place.
https://community.dataiku.com/t5/Using-Dataiku/Create-a-file-in-sftp/m-p/8967
@Ignacio_Toledo
authored a solution on this thread.You may also find it helpful to familiarize yourself with SFTP Connections coming our of Dataiku.
https://doc.dataiku.com/dss/latest/connecting/scp-sftp.html
-
I think the SQL/Python portion of the answer is to be found in this course from Dataiku Academy. Being able to execute SQL directly from Python will allow me to query my dataset as needed. Then the next step will be writing out y number of files to SFTP.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Sounds like a plan. Please let us know how this goes.