Help needed regarding python packages installation using it

sreejithkm
sreejithkm Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer Posts: 12 ✭✭✭✭

Hi,

I am in the Developer Course, being a not-so often coder (non coder) i am stuck with python libraries/package installation and usage. One of the course example is on reading pdf using python recipe, using tabula library, but i m not able to do it successfully as the output showed errors relating to no module tabula found. I installed it via pip from the /bin folder, but is not able see that in the code environment packages in the list of libraries listed from the Admin>codeenv. Please help on how to install the needed python libraries in general, and also in setting code env using conda and another python version say 3.7 though i could create codeenv in Administration>codeenv and set it via the project settings

Thanks & Best Regards

Sreejith

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
    Answer ✓

    Hi,

    Adding the requirements to a separate code environment is usually preferred over installing packages to your directly via pip or in your base code env.

    See: https://doc.dataiku.com/dss/latest/python/packages.html

    In this case, you would want to

    1) Add the requirement "tabula" to a code environment under packages to install and Save and Update.

    Screenshot 2021-09-27 at 07.58.46.png

    2) Change your recipe or Notebook to use this code environment

    Also if you made a change to a code environment and are trying to use it i a notebook you will need to restart the Notebook kernel to detect the latest changes.

Answers

  • sreejithkm
    sreejithkm Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer Posts: 12 ✭✭✭✭

    Dear Alex,

    Thank you so much it worked

    Best Regards

    Sreejith

  • JS
    JS Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1 Partner

    Hello @sreejithkm
    ,

    I believe you are using tabula to read the tables from PDFs. When I try to use tabula.read_pdf am getting import error. Just wanted to know if you had installed any other packages as tabula is java based library.

    I would like to know if tabula.read_pdf can be used to read the tables from PDFs.

    Thank you!


    Regards,

    J

  • sreejithkm
    sreejithkm Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer Posts: 12 ✭✭✭✭

    Hi,

    Please see this below.

    This is a working code

    import dataiku
    import pandas as pd
    from tabula.io import read_pdf
    pdf_path = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"

    dfs = tabula.io.read_pdf(pdf_path, stream=True)
    # read_pdf returns list of DataFrames
    print(len(dfs))
    dfs[0]

    need tabula and tabula.py in the code environment, hope the code environment is selected

    Best Regards

    Sreejith

Setup Info
    Tags
      Help me…