Developing Text Clustering Algorithm

Cluster of Wine Drinkers discussion

Cluster of Wine Drinkers discussion

The idea is to be able to identify sentences in the given text that talk about more or less similar topic. For people, it’s easy to realize a similar topic in something like the following:

  1. “I picked up a bottle of this at a local NC grocery store based on a San Francisco Chronicle gold medal rating, and an excellent price.”
  2. “I have got one at the local wine store in Central NJ for $23.00I will definitely get few cases here.”

To be able to recognize such similarity algorithmically, we need a way to compare meaning in those sentences.

Of course, there is a technique of topic models, which is effective on large corpuses of data. However, it is not as effective if we are trying to find topics within a small given dataset, because there is simply not enough data to form those topics from pure statistics.

My approach:

Words Similarity algorithm“: I used neural network word embedding to measure how semantically similar words are. A simple cosine similarity between neural network vectors representing given words gives an accurate quantative measure of similarity.

  1. I ignored Parts Of Speech that carries little meaining (‘a’, ‘but’, ‘and’, ‘He’ etc.)
  2. I found frequent n-grams from the dataset
  3. I clustered those n-grams using my Words Similarity algorithm to provide a distance metric between elements:

That approach allowed to form clusters of discussion within a small dataset (such as user reviews) effectively.

See an example of such analysis.