Need to import multiple pdf files from sharepoint library

Upon exploration I did find certain things to be fulfilled by admin for me to connect to either a share point library or one drive folder, but I want a brief step by step method to follow so that I can get this done.
In case if there is a better option or method available without intervention of admin, please let me know.
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron
In order to connect Dataiku to Sharepoint or OneDrive you will need an admin to setup the Sharepoint plugin or the OneDrive plugin. In Dataiku v13 you can also now use the built-in Sharepoint connector but this will also require the Administrator to setup the connection. Once the connection is setup you can easily add a new managed folder in that connection or point to an existing one. Finally you could use a Python recipe to load your PDFs.
Answers
-
Hariharan Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 5 ✭✭✭
Can you please elaborate on your final comment on use of python recipe to import the pdf documents as such. In that case should i be using pypdf or pdfdirectory to list files.
Should the code be stopped with reading the pdf which then can be further used as input for predefined recipes from dataiku or
Should the entire process from tokenization to vector store creation should be done through code
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron
Using a Python to read PDFs is just one way of doing it. The best way will depend on what you are trying to achieve (ie your requirements), which you haven't clear specified in your post. "import multiple pdf files" is just not clear enough.
-
Hariharan Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 5 ✭✭✭
Ok, let me elaborate the requirement.
I do have a repository of pdf documents specific to a subject in share point library.
My requirement is to connect to (ingest) this data in dataiku and create a RAG application so that I can have a dedicated chat bot which answers purely based on the context present from the documents ingested. -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron
Well you are in luck then as Dataiku has built-in finctionality for this. Have a look at this documentation page which describes the Embed documents recipe:
There is also a Knowledge base article: