Import multiple files stored in Sharepoint at one time

Options
dot101
dot101 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 ✭✭✭

Hello,

As shown in the image below, I have several Excel files stored in the Sharepoint site, my goal is to import all the files and later stack them into one dataset, is this possible? How, please.
There are too many files that I can't import individually, and every week two files are added, so I want to automate this.

Thanks for any help.

Tagged:

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Answer ✓
    Options

    Hi,

    You can do this with Files from Folder dataset.

    Create a folder with the Sharepoint location where these files are added. Then create a "Files from the folder" dataset. +Dataset - Internal - Files from Folder. Go to the advanced option once you select the folder and then use regex/glob to match you file pattern.

    Now if you use this dataset every time you build your flow it will read all files that match this pattern including new files that were added.

    Screenshot 2022-03-16 at 21.31.04.png

    Screenshot 2022-03-16 at 21.32.58.pngScreenshot 2022-03-16 at 21.33.13.png

    Let me know if this works for you!

Answers

  • dot101
    dot101 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 ✭✭✭
    Options

    Hi Alex, this is what I was looking for, thanks a lot!

  • UserKp
    UserKp Registered Posts: 20
    Options

    what if there are multiple files in the sharepoint and i want all of them as different datasets?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    The number of files is not relevant, what's important is the number of files with different structure. For each file with different structure you need a separate dataset. You can use the Files in Folder dataset to import multiple files of the same structure.

  • UserKp
    UserKp Registered Posts: 20
    Options

    There are different csv files in sharepoint,and all of them are having different columns,schemas etc and I cannot combine them,I have to read all of them because of certain tasks, but in the flow I want it to look like a single entity/folder,because new files might be added

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    Just like in a RDBMS database you will need 1 dataset per schema. You can actually create a Dataiku Managed Folder in Sharepoint if you want, and it will show all your files in a single folder. Then you can add as many Files in Folder datasets to match all the different files types you have (see sample below). Each Files in Folder can filter to match the relevant files that it needs to load (ie EU_health_data_new_* would be one filter). You can also have a separate flow zone for these datasets so as to separate them from the rest of the flow. If this doesn't meet your requirement then you need to explain why and what exactly you are trying to achieve.Screenshot 2024-01-03 at 09.49.42.png

  • UserKp
    UserKp Registered Posts: 20
    Options

    I understand that I can create as many datasets as I want, but that is what I want to avoid. I don't want to create them in my flow since it will add around 20 datasets to the flow.

    I have multiple files in my SharePoint location, I have created a managed folder using the Sharepoint plugin, and now any time a new file is added, the folder is updated.

    Now I want to use these files in a recipe as input since we can do this with a local folder which I have already tried, it's working, I want to achieve the same with that SharePoint folder but it seems like it does not work because, while working with local files we can get the path of those files like Dataiku. folder("folder_id_or_fodler_name").get_path() and then we can use those files.

    But the problem with the SharePoint folder is we cannot use the get_path() option and we have to use the stream reader.

    So I want to use the SharePoint location to read those files and use them in a recipe, that's my requirement, and new files may or may not be added.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    Just like with your API service question it seems that that you prefer to go in a way that is an anti-pattern. The whole point of the Dataiku flow is that it visually representes your data pipeline, making it very easy for you and others looking at your project to understand your inputs, your transformations and your outputs. Having a Python recipe that just loads files that do not exist as datasets or as an input behind the scenes is an anti-pattern. Like I said on the other post, sometimes there are valid reasons to go against a pattern. But on both of your posts you haven't really provided an actual business reason as to why you want to do what you want to do. Adding 20 datasets to a flow doesn't seem like a problem to me. Specially since you add a flow zone and collapse the zone to keep the flow de-cluttered. If this setup doesn't work for you then explain why rather than just saying you don't want to do it ("I don't want to create them in my flow") or you can't do it ("I cannot combine them"). For the record the Stack recipe can "stack" datasets with different columns.

  • subinpius4u
    subinpius4u Dataiku DSS Core Designer, Registered Posts: 1
    Options

    Hi Alex,

    I have exactly similar situation where I have multiple files a sharepoint folder. One file has 2 sheets with different schema.
    I am trying combine this together to load this into sql table.
    The issue what am facing is that even though it says that 2 files are selected in the dataset, when I explore the dataset am seeing the data only from one file and the final sql table has the data from 1 file only.

    Let me know what am missing here

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    Please start a new thread. This thread has been marked as resolved already.

Setup Info
    Tags
      Help me…