2014 - N-gram-based Low-dimensional Representation for Document Classification

Remi Lebret, Ronan Collobert

Hypothesis / Main methods

  • N-grams can be represented by averaging/summing their corresponding word vectors. Then a K-means clustering approach can cluster semantically similar concepts (n-grams). Essentially, each n-gram is assigned to one of the K clusters and each document is represented by a feature vector of dimension K where elements are count based features.

Results

  • Achieve better results in comparison with LDA / LSA.