Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

Python memory issue with a DataIKU word2vec embedding plugin

Rama
Level 1
Level 1
Python memory issue with a DataIKU word2vec embedding plugin

Hi,

I am trying to use Sentence Embedding plugin to Compute numerical representations for sentences so that they can be fed to ML algorithms in my project. I have downloaded Word2vec pre trained word vectors from macro. While running Sentence Embedding plugin I am getting Python memory issue and the screenshot for the same is attached here. Please let me know possible solutions. Thank you !

Regards,

Rama

 

 

0 Kudos
3 Replies
AlexT
Dataiker
Dataiker

Hi Rama,

The error suggests the python process was killed by the OS. This could be due to the overall memory available on DSS and OOM killer terminating the process or due to cgroup configuration. 

To start with you can check : 

1) Available memory for DSS using free -g 

2) Cgroup configuration 

It's hard to say exactly how much memory will be required to run the actual recipe. For this type of issue, it's usually advised to raise a ticket with support and share the job diagnostics. 

 

0 Kudos
Rama
Level 1
Level 1
Author

Hi AlexT,

Thank you for your quick response.

Just wanted to share some additional information.

I have tried with Elmo for the dataset with 10 records and then with another Dataset with 300 records, the Sentence embedding plugin went well and provided the embeddings.

I have tried the same Datasets to run Word2vec with 10 records it failed with an error OOM(out of memory).

I have also worked with our infra person about the CPU utilization and Disk space metrics, he updated me that whenever I am running the sentence embedding Plugin using Word2vec, the CPU utilization spike to 100%.

Any thoughts on why Word2vec errors out and not other embeddings like ELMo. Did anything change in the Word2vec plugin.

 

Regards,

Rama

0 Kudos
AlexT
Dataiker
Dataiker

Elmo and Word2vec are different. 

Word2vec can be memory and CPU-intensive. No recent changes in the plugin recently. Essentially you will need to ensure you have enough memory available on the DSS server and the cgroup configuration allow.

It's difficult to say for sure how much memory you actually need in your case, as the plugin doesn't have an estimated memory option. You can  however check using word2vec  in python code directly

https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.estimate_memory 

This explains a bit more details on how memory can be actually calculated. 

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#memory

 

0 Kudos
A banner prompting to get Dataiku DSS