Read CSVs from a folder
I have a folder with CSVs in it (by "folder" I mean the thing you get when you're doing +dataset -> Folder from the flow) . They are named "dataset_01", "dataset_02" and so on.
I'm trying to read one of them in a Python recipe. What's the code ?
I tried something like this, but it wants me to add "path_of_csv" to inputs, so it's not what I'm looking for.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os
# Recipe inputs
folder_path = dataiku.Folder("FuShmlsH").get_path()
path_of_csv = os.path.join(folder_path, "dataset_01.csv")
my_dataset = dataiku.Dataset(path_of_csv).get_dataframe()
# Recipe outputs
test = dataiku.Dataset("test")
test.write_with_schema(my_dataset)
Thanks.
Best Answer
-
Hello,
You can only import inputs to your recipe using "dataiku.Dataset("xx").get_dataframe()"
In your case, the input is not a dataset, it's a folder! So you correctly used "dataiku.Folder("xx")" already and you're done.
Now you can just read some files from it!
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import os
# Recipe inputs
folder_path = dataiku.Folder("FuShmlsH").get_path()
path_of_csv = os.path.join(folder_path, "dataset_01.csv")
my_dataset = pd.read_csv(path_of_csv)
Answers
-
Thanks.
Could you also give me the code to write a CSV to a folder please ? -
In case it's of interest to anyone :
your_pandas_dataframe.to_csv(os.path.join(write_path, "name_of_file"), sep=";") -
Hi, I am trying to use the CSV file as input from the folder using python recipe
Import dataiku
Import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
Import os
#Recipe inputs
folder_path = dataiku.Folder("xx/x/x/x").get_path()
path_of_csv = os.path.join(folder_path, "xxxx.csv")
my_dataset = pd.read.csv(path_of_csv)
#Recipe outputs
df_Import = dataiku.Dataset("df_Import")
df_Import.write_with_schema(my_dataset)
my_dataset
This is giving me error in python process- Managed folder xx/x/x/x cannot be used: declare it as input or output of your recipe.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Welcome to the Dataiku Community.
This confused me for a while with Dataiku. A Managed folder in Dataiku is not exactly like a folder on disk. It is sort of a handle designed to work with a variety of data storage connections like SFTP or S3 as well as the local file system if you choose.
You have to create the managed folder first from the UI, then you can use it from your python recipe. The name for the managed folder is the name you gave the folder when you created it in DSS. Something like My_Folder. (It is not referenced by it path on the local disk.)
Then when you create your python recipe you need to connect the managed folder to your python recipe.
For example from your code segement you can use
folder_path = dataiku.Folder("xx/x/x/x").get_path()
with "xx/x/x/x" replace with the name of the managed folder that happens to be on the local file system to get the actual path to this Managed folder.
This level of indirection is designed (I think) to help abstract away some of the issues you will run into when moving a project from one node to the next.
Here is the managed folder Python API documentation.
https://doc.dataiku.com/dss/latest/python-api/managed_folders.html
However, you might find a tutorial on the subject a bit more helpful.
https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html
Here is a community thread as well.
Let us know how you are getting on with this.