Read CSVs from a folder

Solved!
bored_panda
Level 2
Read CSVs from a folder

I have a folder with CSVs in it (by "folder" I mean the thing you get when you're doing +dataset -> Folder from the flow) . They are named "dataset_01", "dataset_02" and so on.



I'm trying to read one of them in a Python recipe. What's the code ?



I tried something like this, but it wants me to add "path_of_csv" to inputs, so it's not what I'm looking for.



 




# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os

# Recipe inputs
folder_path = dataiku.Folder("FuShmlsH").get_path()

path_of_csv = os.path.join(folder_path, "dataset_01.csv")
my_dataset = dataiku.Dataset(path_of_csv).get_dataframe()

# Recipe outputs
test = dataiku.Dataset("test")
test.write_with_schema(my_dataset)


Thanks.

0 Kudos
1 Solution
cperdigou
Dataiker Alumni

Hello,



You can only import inputs to your recipe using "dataiku.Dataset("xx").get_dataframe()"



In your case, the input is not a dataset, it's a folder! So you correctly used "dataiku.Folder("xx")" already and you're done.



Now you can just read some files from it!




# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os

# Recipe inputs
folder_path = dataiku.Folder("FuShmlsH").get_path()

path_of_csv = os.path.join(folder_path, "dataset_01.csv")

my_dataset = pd.read_csv(path_of_csv)


 

View solution in original post

5 Replies
cperdigou
Dataiker Alumni

Hello,



You can only import inputs to your recipe using "dataiku.Dataset("xx").get_dataframe()"



In your case, the input is not a dataset, it's a folder! So you correctly used "dataiku.Folder("xx")" already and you're done.



Now you can just read some files from it!




# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os

# Recipe inputs
folder_path = dataiku.Folder("FuShmlsH").get_path()

path_of_csv = os.path.join(folder_path, "dataset_01.csv")

my_dataset = pd.read_csv(path_of_csv)


 

bored_panda
Level 2
Author
Thanks.

Could you also give me the code to write a CSV to a folder please ?
0 Kudos
bored_panda
Level 2
Author
In case it's of interest to anyone :

your_pandas_dataframe.to_csv(os.path.join(write_path, "name_of_file"), sep=";")
0 Kudos
Aditya1
Level 1

Hi, I am trying to use the CSV file as input from the folder using python recipe

 

Import dataiku

Import pandas as pd, numpy as np

from dataiku import pandasutils as pdu

Import os

#Recipe inputs

folder_path = dataiku.Folder("xx/x/x/x").get_path()

path_of_csv = os.path.join(folder_path, "xxxx.csv")

my_dataset = pd.read.csv(path_of_csv)

#Recipe outputs

df_Import = dataiku.Dataset("df_Import")

df_Import.write_with_schema(my_dataset)

my_dataset

 

This is giving me error in python process- Managed folder xx/x/x/x cannot be used: declare it as input or output of your recipe.

 

 

0 Kudos
tgb417

@Aditya1 

Welcome to the Dataiku Community.

This confused me for a while with Dataiku.  A Managed folder in Dataiku is not exactly like a folder on disk.  It is sort of a handle designed to work with a variety of data storage connections like SFTP or S3 as well as the local file system if you choose.

You have to create the managed folder first from the UI, then you can use it from your python recipe.  The name for the managed folder is the name you gave the folder when you created it in DSS. Something like My_Folder.  (It is not referenced by it path on the local disk.)

Then when you create your python recipe you need to connect the managed folder to your python recipe.  

For example from your code segement you can use 

folder_path = dataiku.Folder("xx/x/x/x").get_path()

with "xx/x/x/x" replace with the name of the managed folder that happens to be on the local file system to get the actual path to this Managed folder.  

This level of indirection is designed (I think) to help abstract away some of the issues you will run into when moving a project from one node to the next.

Here is the managed folder Python API documentation.

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html

However, you might find a tutorial on the subject a bit more helpful.

https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html

Here is a community thread as well. 

https://community.dataiku.com/t5/Using-Dataiku/Listing-and-Reading-all-the-files-in-a-Managed-Folder...

Let us know how you are getting on with this.  

--Tom
0 Kudos