Academy
- Join the Academy Benefit from guided learning opportunities →
Community
Documentation
- Reference Documentation Comprehensive specifications of Dataiku →
Knowledge
- Knowledge Base Articles and tutorials on Dataiku features →
Developer
- Developer Guide Tutorials and articles for developers and coder users →
For You

Community
»
Discussions
»
Product Ideas
»

Options

Subscribe to RSS Feed
Mark as New
Mark as Read
Bookmark
Subscribe
Printer Friendly Page
Report Inappropriate Content

OCR/Tabula integration

0 Kudos

Status: New Submitted by

natejgardner on ‎07-17-2023 09:10 PM

I often need to extract tables from PDF documents. I've found some useful OSS tools for this, such as Tabula. This has been helpful, but isn't very accessible to my team and can be difficult to keep up to date as documents are revised, since the process isn't as simple as uploading the new version of the document to Dataiku.

While there are Python libraries that also try to extract tables from PDFs, they don't provide the interactive feedback loop needed to accurately capture large tables. It would be really helpful if Dataiku could support an interface similar to Tabula for this use-case, perhaps as a plugin.

Comment

Preview Exit Preview

never-displayed

Hint:

@ links to members, content

Subscribe to this idea

Labels

Labels (1)

Labels

Data Exploration and Preparation

Completed Ideas

Consistent display of chart title when hover on chart tab
I want to use Dataiku in Japanese.
Programmatic Git Support (Shell, Python API or Both)
Method to re-order V12 Visual ML override rules

View All ≫

Idea Statuses

New 276
In the Backlog 150
Developing 3
Released 52
Gathering Input 2
Parked 1

Privacy Policy
Cookie Policy
Events Code Of Conduct

OCR/Tabula integration

Labels

Data Exploration and Preparation

Consistent display of chart title when hover on chart tab

I want to use Dataiku in Japanese.

Programmatic Git Support (Shell, Python API or Both)

Method to re-order V12 Visual ML override rules