Using a python 2.7 code environment I was able to use the macro's pre-trained word embeddings and create sentence embeddings using the plug in for a column in my dataset which is a corpus of text data. What I'm trying to do is figure out how I can use DSS to classify the text data in that column by using the sentence embeddings. The sentence embeddings are put in a different column. I would also like to map back to the actual words when the clustering is complete so I can get text analysis.
What I tried doing was use the available K-means algorithm on the sentence embedding column to create clusters of data but I often got list of of index errors during fit of the model.
Can you help me with advice on how to use sentence embeddings plugin out with existing deep/machine learning algorithms.
Thanks Alex for that information, it really helps.
Right now I have the cluster number and cluster ID located in a different column how do I convert the vectors back to words so the cluster can be the string of word instead?
For example, instead of cluster ID I want it to display the topics.
The plugin outputs vectors at the sentence level, so there's no direct way to map it back to words.
If your use case is about understanding clusters of documents, I would suggest using the 'Topic modeling' predefined notebook: https://doc.dataiku.com/dss/latest/notebooks/predefined-notebooks.html