Import multiple files stored in Sharepoint at one time
Hello,
As shown in the image below, I have several Excel files stored in the Sharepoint site, my goal is to import all the files and later stack them into one dataset, is this possible? How, please.
There are too many files that I can't import individually, and every week two files are added, so I want to automate this.
Thanks for any help.
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
You can do this with Files from Folder dataset.
Create a folder with the Sharepoint location where these files are added. Then create a "Files from the folder" dataset. +Dataset - Internal - Files from Folder. Go to the advanced option once you select the folder and then use regex/glob to match you file pattern.
Now if you use this dataset every time you build your flow it will read all files that match this pattern including new files that were added.
Let me know if this works for you!
Answers
-
dot101 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 ✭✭✭
Hi Alex, this is what I was looking for, thanks a lot!
-
what if there are multiple files in the sharepoint and i want all of them as different datasets?
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
The number of files is not relevant, what's important is the number of files with different structure. For each file with different structure you need a separate dataset. You can use the Files in Folder dataset to import multiple files of the same structure.
-
There are different csv files in sharepoint,and all of them are having different columns,schemas etc and I cannot combine them,I have to read all of them because of certain tasks, but in the flow I want it to look like a single entity/folder,because new files might be added
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
Just like in a RDBMS database you will need 1 dataset per schema. You can actually create a Dataiku Managed Folder in Sharepoint if you want, and it will show all your files in a single folder. Then you can add as many Files in Folder datasets to match all the different files types you have (see sample below). Each Files in Folder can filter to match the relevant files that it needs to load (ie EU_health_data_new_* would be one filter). You can also have a separate flow zone for these datasets so as to separate them from the rest of the flow. If this doesn't meet your requirement then you need to explain why and what exactly you are trying to achieve.
-
I understand that I can create as many datasets as I want, but that is what I want to avoid. I don't want to create them in my flow since it will add around 20 datasets to the flow.
I have multiple files in my SharePoint location, I have created a managed folder using the Sharepoint plugin, and now any time a new file is added, the folder is updated.
Now I want to use these files in a recipe as input since we can do this with a local folder which I have already tried, it's working, I want to achieve the same with that SharePoint folder but it seems like it does not work because, while working with local files we can get the path of those files like Dataiku. folder("folder_id_or_fodler_name").get_path() and then we can use those files.
But the problem with the SharePoint folder is we cannot use the get_path() option and we have to use the stream reader.
So I want to use the SharePoint location to read those files and use them in a recipe, that's my requirement, and new files may or may not be added.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
Just like with your API service question it seems that that you prefer to go in a way that is an anti-pattern. The whole point of the Dataiku flow is that it visually representes your data pipeline, making it very easy for you and others looking at your project to understand your inputs, your transformations and your outputs. Having a Python recipe that just loads files that do not exist as datasets or as an input behind the scenes is an anti-pattern. Like I said on the other post, sometimes there are valid reasons to go against a pattern. But on both of your posts you haven't really provided an actual business reason as to why you want to do what you want to do. Adding 20 datasets to a flow doesn't seem like a problem to me. Specially since you add a flow zone and collapse the zone to keep the flow de-cluttered. If this setup doesn't work for you then explain why rather than just saying you don't want to do it ("I don't want to create them in my flow") or you can't do it ("I cannot combine them"). For the record the Stack recipe can "stack" datasets with different columns.
-
Hi Alex,
I have exactly similar situation where I have multiple files a sharepoint folder. One file has 2 sheets with different schema.
I am trying combine this together to load this into sql table.
The issue what am facing is that even though it says that 2 files are selected in the dataset, when I explore the dataset am seeing the data only from one file and the final sql table has the data from 1 file only.Let me know what am missing here
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
Please start a new thread. This thread has been marked as resolved already.