Conundrum 20: Fashion Reviews - What's in a word?

MichaelG · ‎08-21-2020

Today’s conundrum is the second in our ‘Fashion Reviews’ series. Today we take a look at the reviews we cleaned up last time and aim to analyse their sentiment. With a little luck we can begin to identify some common themes.

Using the cleaned data you generated when completing Conundrum 17 and utilising the Sentiment analysis plugin can you create a word cloud for the positive and negative reviews?

What jumps out at you from each cloud? Are there any insights you can draw on what things people like and what might need improvement?

For bonus points: try making your word cloud code (R or Python) into a plugin!

Good luck!

PS: Need more practice using the Sentiment analysis plugin? Give Conundrum 18 a go!

I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

tgb417 · ‎08-22-2020

Katie Gross did a nice presentation on NLP last week. However, I'm having difficulties finding the Word Cloud Plugin.

I found this on Git Hub.

https://github.com/dataiku/dss-plugin-nlp-visualization

But it does not look complete.

And this Github repo for the word cloud plugin is empty at this time.

https://github.com/dataiku/dss-plugin-wordcloud

The NLP BrightTalk presentation of earlier this week references a word cloud plugin at

https://www.brighttalk.com/webcast/17108/430154?

@MichaelG I could build a shiny app to do a word cloud. However, that feels like more work than I want to do for a Conundrum.

--Tom

AnitaC · ‎08-28-2020

Hi @tgb417 ,

There is no specific plugin for word clouds in DSS as of today. What @MichaelG was suggesting was to write your own code (in R or Python) that would, for example, save the word cloud image in a managed folder. You could then turn that code into a plugin (could be useful for your 'non-coder' colleagues or even yourself if you think you will want to generate more word clouds in the future!).

You might want to check the wordcloud package (same name for R and Python). A wordcloud can be easily generated in very few lines of code using both!

tgb417 · ‎08-28-2020

Thanks for the insight.

Here is a first attempt at a word cloud from the data. This does not answer the whole conundrum. This is from all of the words by the number of occurrences, not sentiment segregated words.

@AnitaC I don't see a lot about creating a chart type plugin in the documentation. Saving the image files to a folder does not seem to be very DSS like way to do this. I know there is a way to package a chart into a plugin, because there are other charts in the app store. However, I'm not seeing a lot of documentation on setting this up. Can you point me to the details?

P.S. A big Thanks to Katie Gross of Dataiku for a lot of help here.

--Tom

tgb417 · ‎08-28-2020

Here are a:

Word in Positive reviews

size by the number of times the word appears

Word in Negative reviews

size by the number of times the word appears

However, there is a huge overlap. I'm starting to think about methods to maybe pull words that just show up in the positive word cloud and those that show up in the negative word cloud.

--Tom

tgb417 · ‎08-28-2020

Here are two more.

Words in Positive Posts

with predicted confidence of positive post >= .95
n = 15,009 posts

Words in Negative Posts

with predicted confidence of negative post of >= .95
n = 3,332 posts

Negitive greater than 95 confidence v2.jpg

In these word clouds, we are removing all of the posts for which we were unsure if the post was positive or negative.

That's all for me for now.

--Tom

tgb417 · ‎08-30-2020

So, today I've been playing with this a bit.

I ended up building a logistic regression model that would predict what words (and tri-grams and bi-grams) drive a Positive or Negative review.

In the visualizations below I am weighting the size of the words by the regression coefficients of that model.

Most Positive Words

by regression coefficients.

Most Negative Words

by regression coefficients.

I'd like to invite others to jump in here with some further ideas.

--Tom

tgb417 · ‎08-31-2020

In looking at the negative phrases I'm seeing all of these phrases like

really wanted ____________
beautiful but _____________
cute but ___________
not worth ___________

In all of these cases I'd really like to know what follows these most indicative phrases about a potential problem. Have not figured out a way to pull this information yet. Anyone got an idea?

--Tom

tgb417 · ‎12-31-2020

I've come back to look at this Conundrum.

I'm wanting to pull out any named entities from these reviews. Things like Dresses, skirts, slacks, tops... And in some ways, I've gotten things like condition sizes, colors, quality. However, I'd like to be able to pull these out more explicitly.

I tried the Named Entity Recognition Plugin however the results using spaCy did not seem to be that good. So, I'm wondering about getting a hold of some of the Flair models. However, I've no figured out

How to get a hold of the Flair model files
Which model files to use.
- There seem maybe to be a bunch of these model files, and maybe even ones designed for review parsing.

Is anyone familiar with using this plugin able to comment?

Thanks for any help you can share.

cc: @duphan

--Tom

taraku · ‎08-28-2020

Thanks for reminding me about the sentiment analysis plugin!

Sign up to take part

Conundrum 20: Fashion Reviews - What's in a word?

Conundrum 20: Fashion Reviews - What's in a word?

Word in Positive reviews

size by the number of times the word appears

Word in Negative reviews

size by the number of times the word appears

Words in Positive Posts

with predicted confidence of positive post >= .95n = 15,009 posts

Words in Negative Posts

with predicted confidence of negative post of >= .95n = 3,332 posts

Most Positive Words

by regression coefficients.

Most Negative Words

by regression coefficients.

with predicted confidence of positive post >= .95
n = 15,009 posts

with predicted confidence of negative post of >= .95
n = 3,332 posts