Loading FIles from Managed Folder using Python

SuhasTalanki
Level 1
Loading FIles from Managed Folder using Python

Hi Team,

I have around 50-60 csv files inside dataiku managed folder. How can I stack all the ".txt" files using python/Pyspark. These files are dynamic in nature. Each file size is of 2GB. So Any help would be appreciated in this regards.

File Name Pattern: TB_LEVEL0_*

 

 

 

0 Kudos
3 Replies
Ignacio_Toledo

Hello @SuhasTalanki,

I didn't have time to test it yet, but this script should do the trick (if I understood your problem well), provided that:

  • all files with the given patter are readable as csv (and no other files with the same prefix exist)
  • all files have the same number of columns (same schema if you want)
  • the dataset "dataset_to_stack_files" must already have been created within your project.
import dataiku
import pandas as pd

folder_path = dataiku.Folder("your_dataiku_folder").get_path()
out_stacked_dataset = dataiku.Dataset("dataset_to_stack_files")

c = 0
with out_stacked_dataset.get_writer():
    for file in folder_path.list_paths_in_partition():
        if 'TB_LEVEL0_' in file:
             tmpdf = pd.read_csv(file)  # add all other options
             if c == 0:
                 # we need to write the schema for the first chunk
                 out_stacked_dataset.write_with_schema(tmpdf)
                 c = -1
             else:
                 out_stacked_dataset.write_dataframe(tmpdf)

 

Let me know if something doesn't work or is not clear

Cheers

0 Kudos
SuhasTalanki
Level 1
Author

Hi @Ignacio_Toledo ,

The above solution worked fine after few tweeks. We need to check the append to dataset option if we have to retain the data from all files.

Given the file size is huge, the code is taking lot of time to read all the 30-40 csv files onto dataiku tables. 

Thank you. 

Happy to have been able to help a bit! You are right, the recipe would need some tweaks to append instead of overwriting, good that you find the way.

About working with huge csv files, there are options to read them by chunks with pandas. Here is a short example on how to do it, and of course there is more in the pandas documentation. I don't know, though, how this will interact with the dataset writer, but I think some speed improvement could be made in that way.

 

0 Kudos