Sentence Embedding - Machine/Deep Learning Model

vinhdiesal · ‎05-06-2020

Using a python 2.7 code environment I was able to use the macro's pre-trained word embeddings and create sentence embeddings using the plug in for a column in my dataset which is a corpus of text data. What I'm trying to do is figure out how I can use DSS to classify the text data in that column by using the sentence embeddings. The sentence embeddings are put in a different column. I would also like to map back to the actual words when the clustering is complete so I can get text analysis.

What I tried doing was use the available K-means algorithm on the sentence embedding column to create clusters of data but I often got list of of index errors during fit of the model.

Can you help me with advice on how to use sentence embeddings plugin out with existing deep/machine learning algorithms.

Thanks,

Alex_Combessie · ‎05-07-2020

Hi,

In order to use the embedding column for Machine Learning / Deep Learning models, you can choose the "Vector" feature handling, as shown below:

Hope it helps,

Alex

vinhdiesal · ‎05-07-2020

Thanks Alex for that information, it really helps.

Right now I have the cluster number and cluster ID located in a different column how do I convert the vectors back to words so the cluster can be the string of word instead?

For example, instead of cluster ID I want it to display the topics.

Thanks,

Vinh

Alex_Combessie · ‎05-07-2020

Hi,

The plugin outputs vectors at the sentence level, so there's no direct way to map it back to words.

If your use case is about understanding clusters of documents, I would suggest using the 'Topic modeling' predefined notebook: https://doc.dataiku.com/dss/latest/notebooks/predefined-notebooks.html

Best regards,

Alex

Sign up to take part

Sentence Embedding - Machine/Deep Learning Model

Sentence Embedding - Machine/Deep Learning Model