Queries on PDF Data Integration and Knowledge Bank Deployment in Dataiku 12.4.1
Dear Dataiku community,
After reviewing several informative articles on your platform, I have two specific queries:
About the LLM-Mesh-RAG tutorial (Link), I seek clarity on the best practices for integrating PDF files as data sources. Is there a way to directly utilize PDF files, or is extracting data via Python scripts recommended?
Concerning the deployment of a knowledge bank created from an embedded recipe in Dataiku 12.4.1, I am interested in understanding deployment options beyond the Prompt Studio. Specifically, are there ways to deploy through platforms like Gradio or Streamlit for inference, I have looked into several resources including the plugin development guide (Link), tutorials on prompt engineering (Link), question answering using RAG (Link), and the concept of RAG (Link) but couldn't find deployment options? Could you provide any guidance or direct me to relevant documentation?
Thank you for your assistance and the comprehensive resources provided.
Best regards
Shantanu
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Shantanu_dave
,
The recently updated plugin is called Text Extraction and OCR) has two OCR models/engines: TESSERACT and EASYOCR.
https://www.dataiku.com/product/plugins/tesseract-ocr/
While you can do it in code, you can also do it with a plugin recipe, as mentioned here:
https://gallery.dataiku.com/projects/EX_LLM_STARTER_KIT/wiki/1/Project%20description#loading-splitting-and-vectorizing-documents-1
You can have an example in the above project for web apps and some additional solutions should be published in the future.
Thanks