How to use NLTK in DSS

Options
Alex_Combessie
Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
edited July 16 in Knowledge Base

Greetings fellow Linguists,

You can start by installing NLTK (Natural Language Toolkit) as any other Python package in DSS, by creating a code environment and adding "nltk" to your package requirements. To do so, follow this documentation.

However, some functionalities of NLTK such as text corpora and language-specific models rely on resources which are not bundled in the library itself. The full list of available resources is available here.

To use these models, you need an additional download step. Typically, this can create issues on shared DSS nodes where users do not have write access to shared locations on the server (see User Isolation Framework).

1. Download NLTK Data for all users (recommended)

WARNING: this procedure needs command-line access and administrative privileges on the machine hosting DSS. You may need to speak to your DSS admin and/or Linux admin. Assuming you are on a Linux machine and have administrative privileges, run:
pip install nltk
sudo python -m nltk.downloader -d /usr/share/nltk_data all
For macOS, the path is slightly different: /usr/local/share/nltk_data.
To test that it worked correctly, run the following code in a notebook using your code environment with nltk.
from nltk.corpus import brown
print(brown.words())
For further details, please refer to this NLTK documentation.
2. Download NLTK Data for yourself
WARNING: This code will not work for other users if your DSS node is configured with the User Isolation Framework. Run this command without sudo, pointing to your Linux home directory:
python -m nltk.downloader -d /home/<yourLinuxUserName>/nltk_data all
In your Python code, you will then need to set the variable "NLTK_DATA" before running code requiring it.
import os
os.environ['NLTK_DATA'] = /home/<yourLinuxUserName/nltk_data
Happy natural language processing!

Comments

  • Peter_van_Klave
    Peter_van_Klave Partner, Registered Posts: 10 Partner
    Options

    How can I download the NLTK Data if I install the 'nltk' package in a (Dataiku controlled) virtual environment? If I just use the 'sudo python -m nltk.download ...' from the command-line, the nltk-package is not found.

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options

    Hi,

    The procedure highlighted above is indeed designed for use in Dataiku-managed code environment. The command you used is correct (see NLTK doc). Have you checked that `python` is not pointing to Python 2? NLTK dropped its support of Python 2 recently.

    If that does not work, we will need the full output of your command to diagnose.

    Best regards,

    Alex

  • Peter_van_Klave
    Peter_van_Klave Partner, Registered Posts: 10 Partner
    Options

    Hi Alex,
    the problem also occurs when I run the command (in a terminal) with python3; I get an error: (ModuleNotFoundError: No module named 'nltk'). So, I think I need to run the command in the Dataiku code environment in which I installed the nltk. How can I do that? Should I just navigate in a terminal to the folder containing the code environment and run the command?

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options

    Thanks, that's clearer. You are just missing the nltk package in your main Python environment.

    Can you try:

    - pip install nltk

    - sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

    No need to navigate to the Dataiku code-env folder. It's a central installation, it should be picked up by the code-env automatically.

    If that doesn't work, please send us the full output of the commands you executed.

  • Peter_van_Klave
    Peter_van_Klave Partner, Registered Posts: 10 Partner
    Options

    Hi Alex,

    I missed the installation of nltk in the main Python environment. That indeed solved the problem.

    Many thanks.

    Regards,

    Peter

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options

    No problem, I will add that to the original article for completeness.

  • gmanoj_reddy
    gmanoj_reddy Registered Posts: 1 ✭✭✭✭
    Options

    Hi Alex,

    I installed NLTK using pip in code evnu, while using wordnet in python recipe its pointing to different location. so I am getting wordnet not installed

    How to change path in python recipe so that it can point to location where it downloaded.

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options

    Hi @gmanoj_reddy
    ,

    Have you applied the step "1. Download NLTK Data for all users (recommended)" from the article?

    Cheers,

    Alex

Setup Info
    Tags
      Help me…