Developing Text Clustering Algorithm

Cluster of Wine Drinkers discussion

Cluster of Wine Drinkers discussion

Intuition:
The idea is to be able to identify sentences in the given text that talk about more or less similar topic. For people, it’s easy to realize a similar topic in something like the following:

  1. “I picked up a bottle of this at a local NC grocery store based on a San Francisco Chronicle gold medal rating, and an excellent price.”
  2. “I have got one at the local wine store in Central NJ for $23.00I will definitely get few cases here.”

To be able to recognize such similarity algorithmically, we need a way to compare meaning in those sentences.

Of course, there is a technique of topic models, which is effective on large corpuses of data. However, it is not as effective if we are trying to find topics within a small given dataset, because there is simply not enough data to form those topics from pure statistics.

My approach:

Words Similarity algorithm“: I used neural network word embedding to measure how semantically similar words are. A simple cosine similarity between neural network vectors representing given words gives an accurate quantative measure of similarity.

  1. I ignored Parts Of Speech that carries little meaining (‘a’, ‘but’, ‘and’, ‘He’ etc.)
  2. I found frequent n-grams from the dataset
  3. I clustered those n-grams using my Words Similarity algorithm to provide a distance metric between elements:

That approach allowed to form clusters of discussion within a small dataset (such as user reviews) effectively.

See an example of such analysis.