Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi Team,
I have around 50-60 csv files inside dataiku managed folder. How can I stack all the ".txt" files using python/Pyspark. These files are dynamic in nature. Each file size is of 2GB. So Any help would be appreciated in this regards.
File Name Pattern: TB_LEVEL0_*
Hello @SuhasTalanki,
I didn't have time to test it yet, but this script should do the trick (if I understood your problem well), provided that:
import dataiku
import pandas as pd
folder_path = dataiku.Folder("your_dataiku_folder").get_path()
out_stacked_dataset = dataiku.Dataset("dataset_to_stack_files")
c = 0
with out_stacked_dataset.get_writer():
for file in folder_path.list_paths_in_partition():
if 'TB_LEVEL0_' in file:
tmpdf = pd.read_csv(file) # add all other options
if c == 0:
# we need to write the schema for the first chunk
out_stacked_dataset.write_with_schema(tmpdf)
c = -1
else:
out_stacked_dataset.write_dataframe(tmpdf)
Let me know if something doesn't work or is not clear
Cheers
Hi @Ignacio_Toledo ,
The above solution worked fine after few tweeks. We need to check the append to dataset option if we have to retain the data from all files.
Given the file size is huge, the code is taking lot of time to read all the 30-40 csv files onto dataiku tables.
Thank you.
Happy to have been able to help a bit! You are right, the recipe would need some tweaks to append instead of overwriting, good that you find the way.
About working with huge csv files, there are options to read them by chunks with pandas. Here is a short example on how to do it, and of course there is more in the pandas documentation. I don't know, though, how this will interact with the dataset writer, but I think some speed improvement could be made in that way.