Split Dataset Into Files of n Records Each, Deliver via SFTP

Solved!
MossandRoy
Level 2
Split Dataset Into Files of n Records Each, Deliver via SFTP

I have a business requirement to deliver a dataset via SFTP, but the dataset must be split into files each containing n records. There can be a different number of records each time, which means there can be a different number of files each time. This requirement is based on a size restriction of records per file in the destination system and is not something I can control.

In my dataset I have created a column called GroupNumber which I can use to split the data. The first n records have a value of '1', the next n records have '2', etc...

It seems as if a managed file or folder on the SFTP connection would only output the entire dataset without the option to split it. I've been searching these boards and turning over this puzzle in my mind. I'm unsure of how to solve this. My mind keeps coming back to Python as the most likely solution, but I've only scratched the surface of Python and not really used it at all with Dataiku, so I don't know if it's the right tool for the job or not.

Thank you in advance for any help you can offer in solving this problem. I don't need someone to do this for me, but would appreciate being pointed in the right direction.

 

Cheers!


Operating system used: Windows 11 Enterprise

1 Solution
MossandRoy
Level 2
Author

It may not be pretty, and I'm currently going to a non-SFTP managed folder, but here is the code I've got working! Feedback is most definitely welcome!

 

%pylab inline

# -*- coding: utf-8 -*-
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

# Import the class that allows us to execute SQL on the Studio connections
from dataiku.core.sql import SQLExecutor2

# Get a handle on the input dataset
groupeddata = dataiku.Dataset("MY_DATASET")

# Write recipe outputs
outfolder = dataiku.Folder("FOLDER_ID")

# We create an executor. We pass to it the dataset instance. This way, the
# executor knows which SQL database should be targeted
executor = SQLExecutor2(dataset=groupeddata)

# Find the max value of the column to find out how many files we're working with
max_files = executor.query_to_df("""
SELECT MAX(GROUP_COLUMN) AS MaxGroupNumber
FROM MY_TABLE
""")

for index, maxGroup in max_files.itertuples():
maxGroups = maxGroup

for group in range (1, maxGroups + 1):

outData_df = executor.query_to_df(f"""
SELECT MY_COLUMNS_TO_SEND
FROM MY_TABLE
WHERE GROUP_COLUMN = {group}
""")

filename = f"Test_{group}.txt"

outfolder.upload_data(filename, outData_df.to_csv(index=False).encode("utf-8"))

 

 

 

View solution in original post

5 Replies
tgb417

@MossandRoy ,

Welcome to the Dataiku community we are glad you have joined us.

I have a similar need which I has recently come up that I have not worked out.

If you can program a bit in Python, Iโ€™m fairly sure that you can save your files to a local managed folder and then sync to the SFTP server.

There may however be a more cleaver way to get this done.  Anyone out there have any ideas?  

--Tom
tgb417

@MossandRoy ,

You might find this set of instructions to be a helpful starting place.

https://community.dataiku.com/t5/Using-Dataiku/Create-a-file-in-sftp/m-p/8967

@Ignacio_Toledo authored a solution on this thread.

https://community.dataiku.com/t5/Setup-Configuration/File-from-SFTP-to-Dataiku-csv-excel-on-daily-ba... 

You may also find it helpful to familiarize yourself with SFTP Connections coming our of Dataiku.

https://doc.dataiku.com/dss/latest/connecting/scp-sftp.html

 

--Tom
MossandRoy
Level 2
Author

I think the SQL/Python portion of the answer is to be found in this course from Dataiku Academy. Being able to execute SQL directly from Python will allow me to query my dataset as needed. Then the next step will be writing out y number of files to SFTP.

 

https://academy.dataiku.com/python-and-dataiku-dss/506834

tgb417

@MossandRoy 

Sounds like a plan.  Please let us know how this goes.

--Tom
0 Kudos
MossandRoy
Level 2
Author

It may not be pretty, and I'm currently going to a non-SFTP managed folder, but here is the code I've got working! Feedback is most definitely welcome!

 

%pylab inline

# -*- coding: utf-8 -*-
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

# Import the class that allows us to execute SQL on the Studio connections
from dataiku.core.sql import SQLExecutor2

# Get a handle on the input dataset
groupeddata = dataiku.Dataset("MY_DATASET")

# Write recipe outputs
outfolder = dataiku.Folder("FOLDER_ID")

# We create an executor. We pass to it the dataset instance. This way, the
# executor knows which SQL database should be targeted
executor = SQLExecutor2(dataset=groupeddata)

# Find the max value of the column to find out how many files we're working with
max_files = executor.query_to_df("""
SELECT MAX(GROUP_COLUMN) AS MaxGroupNumber
FROM MY_TABLE
""")

for index, maxGroup in max_files.itertuples():
maxGroups = maxGroup

for group in range (1, maxGroups + 1):

outData_df = executor.query_to_df(f"""
SELECT MY_COLUMNS_TO_SEND
FROM MY_TABLE
WHERE GROUP_COLUMN = {group}
""")

filename = f"Test_{group}.txt"

outfolder.upload_data(filename, outData_df.to_csv(index=False).encode("utf-8"))