Loading FIles from Managed Folder using Python

SuhasTalanki
SuhasTalanki Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer Posts: 2 ✭✭✭✭

Hi Team,

I have around 50-60 csv files inside dataiku managed folder. How can I stack all the ".txt" files using python/Pyspark. These files are dynamic in nature. Each file size is of 2GB. So Any help would be appreciated in this regards.

File Name Pattern: TB_LEVEL0_*

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17

    Hello @SuhasTalanki
    ,

    I didn't have time to test it yet, but this script should do the trick (if I understood your problem well), provided that:

    • all files with the given patter are readable as csv (and no other files with the same prefix exist)
    • all files have the same number of columns (same schema if you want)
    • the dataset "dataset_to_stack_files" must already have been created within your project.
    import dataiku
    import pandas as pd
    
    folder_path = dataiku.Folder("your_dataiku_folder").get_path()
    out_stacked_dataset = dataiku.Dataset("dataset_to_stack_files")
    
    c = 0
    with out_stacked_dataset.get_writer():
        for file in folder_path.list_paths_in_partition():
            if 'TB_LEVEL0_' in file:
                 tmpdf = pd.read_csv(file)  # add all other options
                 if c == 0:
                     # we need to write the schema for the first chunk
                     out_stacked_dataset.write_with_schema(tmpdf)
                     c = -1
                 else:
                     out_stacked_dataset.write_dataframe(tmpdf)

    Let me know if something doesn't work or is not clear

    Cheers

  • SuhasTalanki
    SuhasTalanki Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer Posts: 2 ✭✭✭✭

    Hi @Ignacio_Toledo
    ,

    The above solution worked fine after few tweeks. We need to check the append to dataset option if we have to retain the data from all files.

    Given the file size is huge, the code is taking lot of time to read all the 30-40 csv files onto dataiku tables.

    Thank you.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Happy to have been able to help a bit! You are right, the recipe would need some tweaks to append instead of overwriting, good that you find the way.

    About working with huge csv files, there are options to read them by chunks with pandas. Here is a short example on how to do it, and of course there is more in the pandas documentation. I don't know, though, how this will interact with the dataset writer, but I think some speed improvement could be made in that way.

Setup Info
    Tags
      Help me…