Python memory issue with a DataIKU word2vec embedding plugin

Options
Rama
Rama Partner, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 2 Partner

Hi,

I am trying to use Sentence Embedding plugin to Compute numerical representations for sentences so that they can be fed to ML algorithms in my project. I have downloaded Word2vec pre trained word vectors from macro. While running Sentence Embedding plugin I am getting Python memory issue and the screenshot for the same is attached here. Please let me know possible solutions. Thank you !

Regards,

Rama

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi Rama,

    The error suggests the python process was killed by the OS. This could be due to the overall memory available on DSS and OOM killer terminating the process or due to cgroup configuration.

    To start with you can check :

    1) Available memory for DSS using free -g

    2) Cgroup configuration

    It's hard to say exactly how much memory will be required to run the actual recipe. For this type of issue, it's usually advised to raise a ticket with support and share the job diagnostics.

  • Rama
    Rama Partner, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 2 Partner
    Options

    Hi AlexT,

    Thank you for your quick response.

    Just wanted to share some additional information.

    I have tried with Elmo for the dataset with 10 records and then with another Dataset with 300 records, the Sentence embedding plugin went well and provided the embeddings.

    I have tried the same Datasets to run Word2vec with 10 records it failed with an error OOM(out of memory).

    I have also worked with our infra person about the CPU utilization and Disk space metrics, he updated me that whenever I am running the sentence embedding Plugin using Word2vec, the CPU utilization spike to 100%.

    Any thoughts on why Word2vec errors out and not other embeddings like ELMo. Did anything change in the Word2vec plugin.

    Regards,

    Rama

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Elmo and Word2vec are different.

    Word2vec can be memory and CPU-intensive. No recent changes in the plugin recently. Essentially you will need to ensure you have enough memory available on the DSS server and the cgroup configuration allow.

    It's difficult to say for sure how much memory you actually need in your case, as the plugin doesn't have an estimated memory option. You can however check using word2vec in python code directly

    https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.estimate_memory

    This explains a bit more details on how memory can be actually calculated.

    https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#memory

Setup Info
    Tags
      Help me…