I have around 50-60 csv files inside dataiku managed folder. How can I stack all the ".txt" files using python/Pyspark. These files are dynamic in nature. Each file size is of 2GB. So Any help would be appreciated in this regards.
File Name Pattern: TB_LEVEL0_*
I didn't have time to test it yet, but this script should do the trick (if I understood your problem well), provided that:
import dataiku import pandas as pd folder_path = dataiku.Folder("your_dataiku_folder").get_path() out_stacked_dataset = dataiku.Dataset("dataset_to_stack_files") c = 0 with out_stacked_dataset.get_writer(): for file in folder_path.list_paths_in_partition(): if 'TB_LEVEL0_' in file: tmpdf = pd.read_csv(file) # add all other options if c == 0: # we need to write the schema for the first chunk out_stacked_dataset.write_with_schema(tmpdf) c = -1 else: out_stacked_dataset.write_dataframe(tmpdf)
Let me know if something doesn't work or is not clear
Hi @Ignacio_Toledo ,
The above solution worked fine after few tweeks. We need to check the append to dataset option if we have to retain the data from all files.
Given the file size is huge, the code is taking lot of time to read all the 30-40 csv files onto dataiku tables.
Happy to have been able to help a bit! You are right, the recipe would need some tweaks to append instead of overwriting, good that you find the way.
About working with huge csv files, there are options to read them by chunks with pandas. Here is a short example on how to do it, and of course there is more in the pandas documentation. I don't know, though, how this will interact with the dataset writer, but I think some speed improvement could be made in that way.