OCR/Tabula integration

0 Kudos

I often need to extract tables from PDF documents. I've found some useful OSS tools for this, such as Tabula. This has been helpful, but isn't very accessible to my team and can be difficult to keep up to date as documents are revised, since the process isn't as simple as uploading the new version of the document to Dataiku.

While there are Python libraries that also try to extract tables from PDFs, they don't provide the interactive feedback loop needed to accurately capture large tables. It would be really helpful if Dataiku could support an interface similar to Tabula for this use-case, perhaps as a plugin.