Read CSVs from a folder

Options
bored_panda
bored_panda Registered Posts: 11 ✭✭✭✭
edited July 16 in Using Dataiku

I have a folder with CSVs in it (by "folder" I mean the thing you get when you're doing +dataset -> Folder from the flow) . They are named "dataset_01", "dataset_02" and so on.

I'm trying to read one of them in a Python recipe. What's the code ?

I tried something like this, but it wants me to add "path_of_csv" to inputs, so it's not what I'm looking for.


# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os

# Recipe inputs
folder_path = dataiku.Folder("FuShmlsH").get_path()

path_of_csv = os.path.join(folder_path, "dataset_01.csv")
my_dataset = dataiku.Dataset(path_of_csv).get_dataframe()

# Recipe outputs
test = dataiku.Dataset("test")
test.write_with_schema(my_dataset)

Thanks.

Best Answer

  • cperdigou
    cperdigou Alpha Tester, Dataiker Alumni Posts: 115 ✭✭✭✭✭✭✭
    edited July 17 Answer ✓
    Options

    Hello,

    You can only import inputs to your recipe using "dataiku.Dataset("xx").get_dataframe()"

    In your case, the input is not a dataset, it's a folder! So you correctly used "dataiku.Folder("xx")" already and you're done.

    Now you can just read some files from it!


    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import os

    # Recipe inputs
    folder_path = dataiku.Folder("FuShmlsH").get_path()

    path_of_csv = os.path.join(folder_path, "dataset_01.csv")

    my_dataset = pd.read_csv(path_of_csv)

Answers

  • bored_panda
    bored_panda Registered Posts: 11 ✭✭✭✭
    Options
    Thanks.

    Could you also give me the code to write a CSV to a folder please ?
  • bored_panda
    bored_panda Registered Posts: 11 ✭✭✭✭
    Options
    In case it's of interest to anyone :

    your_pandas_dataframe.to_csv(os.path.join(write_path, "name_of_file"), sep=";")
  • Aditya1
    Aditya1 Registered Posts: 1 ✭✭✭
    Options

    Hi, I am trying to use the CSV file as input from the folder using python recipe

    Import dataiku

    Import pandas as pd, numpy as np

    from dataiku import pandasutils as pdu

    Import os

    #Recipe inputs

    folder_path = dataiku.Folder("xx/x/x/x").get_path()

    path_of_csv = os.path.join(folder_path, "xxxx.csv")

    my_dataset = pd.read.csv(path_of_csv)

    #Recipe outputs

    df_Import = dataiku.Dataset("df_Import")

    df_Import.write_with_schema(my_dataset)

    my_dataset

    This is giving me error in python process- Managed folder xx/x/x/x cannot be used: declare it as input or output of your recipe.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    edited July 17
    Options

    @Aditya1

    Welcome to the Dataiku Community.

    This confused me for a while with Dataiku. A Managed folder in Dataiku is not exactly like a folder on disk. It is sort of a handle designed to work with a variety of data storage connections like SFTP or S3 as well as the local file system if you choose.

    You have to create the managed folder first from the UI, then you can use it from your python recipe. The name for the managed folder is the name you gave the folder when you created it in DSS. Something like My_Folder. (It is not referenced by it path on the local disk.)

    Then when you create your python recipe you need to connect the managed folder to your python recipe.

    For example from your code segement you can use

    folder_path = dataiku.Folder("xx/x/x/x").get_path()

    with "xx/x/x/x" replace with the name of the managed folder that happens to be on the local file system to get the actual path to this Managed folder.

    This level of indirection is designed (I think) to help abstract away some of the issues you will run into when moving a project from one node to the next.

    Here is the managed folder Python API documentation.

    https://doc.dataiku.com/dss/latest/python-api/managed_folders.html

    However, you might find a tutorial on the subject a bit more helpful.

    https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html

    Here is a community thread as well.

    https://community.dataiku.com/t5/Using-Dataiku/Listing-and-Reading-all-the-files-in-a-Managed-Folder/m-p/8140

    Let us know how you are getting on with this.

Setup Info
    Tags
      Help me…