The idea is to be able to identify sentences in the given text that talk about more or less similar topic. For people, it’s easy to realize a similar topic in something like the following:
- “I picked up a bottle of this at a local NC grocery store based on a San Francisco Chronicle gold medal rating, and an excellent price.”
- “I have got one at the local wine store in Central NJ for $23.00I will definitely get few cases here.”
To be able to recognize such similarity algorithmically, we need a way to compare meaning in those sentences.
Of course, there is a technique of topic models, which is effective on large corpuses of data. However, it is not as effective if we are trying to find topics within a small given dataset, because there is simply not enough data to form those topics from pure statistics.
“Words Similarity algorithm“: I used neural network word embedding to measure how semantically similar words are. A simple cosine similarity between neural network vectors representing given words gives an accurate quantative measure of similarity.
- I ignored Parts Of Speech that carries little meaining (‘a’, ‘but’, ‘and’, ‘He’ etc.)
- I found frequent n-grams from the dataset
- I clustered those n-grams using my Words Similarity algorithm to provide a distance metric between elements:
That approach allowed to form clusters of discussion within a small dataset (such as user reviews) effectively.
See an example of such analysis.