read pdf with tabula on S3

Options
EdBerth
EdBerth Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 15

Hi,

I am following this tutorial to work with pdf and managed folders :

https://knowledge.dataiku.com/latest/code/managed-folders/tutorial-managed-folders.html

But reading the pdf with tabula doesn't work, i have this error message

UnsupportedOperation: seek

My managed folder is in S3, how can I read this file ?

Best Answer

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    Answer ✓
    Options

    Can you try to add at the beginning of the code

    import io
    and read the pdf as follows:
    tables = read_pdf(io.BytesIO(stream.read()), pages = "12-26", multiple_tables = True)

    instead of

    tables = read_pdf(stream, pages = "12-26", multiple_tables = True)
Setup Info
    Tags
      Help me…