Loading FIles from Managed Folder using Python
Hi Team,
I have around 50-60 csv files inside dataiku managed folder. How can I stack all the ".txt" files using python/Pyspark. These files are dynamic in nature. Each file size is of 2GB. So Any help would be appreciated in this regards.
File Name Pattern: TB_LEVEL0_*
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Hello @SuhasTalanki
,I didn't have time to test it yet, but this script should do the trick (if I understood your problem well), provided that:
- all files with the given patter are readable as csv (and no other files with the same prefix exist)
- all files have the same number of columns (same schema if you want)
- the dataset "dataset_to_stack_files" must already have been created within your project.
import dataiku import pandas as pd folder_path = dataiku.Folder("your_dataiku_folder").get_path() out_stacked_dataset = dataiku.Dataset("dataset_to_stack_files") c = 0 with out_stacked_dataset.get_writer(): for file in folder_path.list_paths_in_partition(): if 'TB_LEVEL0_' in file: tmpdf = pd.read_csv(file) # add all other options if c == 0: # we need to write the schema for the first chunk out_stacked_dataset.write_with_schema(tmpdf) c = -1 else: out_stacked_dataset.write_dataframe(tmpdf)
Let me know if something doesn't work or is not clear
Cheers
-
SuhasTalanki Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer Posts: 2 ✭✭✭✭
Hi @Ignacio_Toledo
,The above solution worked fine after few tweeks. We need to check the append to dataset option if we have to retain the data from all files.
Given the file size is huge, the code is taking lot of time to read all the 30-40 csv files onto dataiku tables.
Thank you.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Happy to have been able to help a bit! You are right, the recipe would need some tweaks to append instead of overwriting, good that you find the way.
About working with huge csv files, there are options to read them by chunks with pandas. Here is a short example on how to do it, and of course there is more in the pandas documentation. I don't know, though, how this will interact with the dataset writer, but I think some speed improvement could be made in that way.