Sentence Embedding - Machine/Deep Learning Model
Using a python 2.7 code environment I was able to use the macro's pre-trained word embeddings and create sentence embeddings using the plug in for a column in my dataset which is a corpus of text data. What I'm trying to do is figure out how I can use DSS to classify the text data in that column by using the sentence embeddings. The sentence embeddings are put in a different column. I would also like to map back to the actual words when the clustering is complete so I can get text analysis.
What I tried doing was use the available K-means algorithm on the sentence embedding column to create clusters of data but I often got list of of index errors during fit of the model.
Can you help me with advice on how to use sentence embeddings plugin out with existing deep/machine learning algorithms.
Thanks,
Answers
-
Hi,
In order to use the embedding column for Machine Learning / Deep Learning models, you can choose the "Vector" feature handling, as shown below:
Hope it helps,
Alex
-
Thanks Alex for that information, it really helps.
Right now I have the cluster number and cluster ID located in a different column how do I convert the vectors back to words so the cluster can be the string of word instead?
For example, instead of cluster ID I want it to display the topics.
Thanks,
Vinh
-
Hi,
The plugin outputs vectors at the sentence level, so there's no direct way to map it back to words.
If your use case is about understanding clusters of documents, I would suggest using the 'Topic modeling' predefined notebook: https://doc.dataiku.com/dss/latest/notebooks/predefined-notebooks.html
Best regards,
Alex