Text Categorization

The work presented here focuses on feature selection for text categorization, in particular on using feature weights from linear models (obtained e.g. on a subset of the training set) to decide which features to retain and which to discard. The experiments have been performed on large subsets of the Reuters Corpus, Volume 1. In comparing different feature selection methods, we emphasize the notion of "sparsity", i.e. the average number of nonzero features per document, as an independent variable, rather than the number of features. This is because training methods such as the SVM are not very sensitive to the number of features if the documents used are sparse.

Publications:

J. Brank, M. Grobelnik, N. Milic-Frayling, D. Mladenic: Feature selection using support vector machines. Proceedings of the Third International Conference on Data Mining Methods and Databases for Engineering, Finance, and Other Fields, Bologna, Italy, 25--27 September 2002. Slides from this talk. Slides from a presentation of this work given at Redmond in July 2002.

(Discusses several aspects of using weights from a linear SVM model trained on a subset of the training set as suggestions for feature selection, where one discards features whose weights are close to 0. If the training set does not fit into available memory, it is likely to be better to discard some documents and use feature selection to reduce the memory requirements of the remaining documents than to try satisfying the memory requirements just by discarding documents. Feature selection does not significantly improve the categorization performance of SVM, but it can conserve a lot of memory without significantly degrading the performance.)
J. Brank, M. Grobelnik, N. Milic-Frayling, D. Mladenic: Feature selection using linear support vector machines. Microsoft Research Technical Report MSR-TR-2002-63, 12 June 2002.

(A longer version of the above paper, and recommended in preference to it. It contains additional experimental data, as well as some statistics about the Reuters Volume 1 Corpus.)
J. Brank, M. Grobelnik, N. Milic-Frayling, D. Mladenic: Interaction of feature selection methods and linear classification models. Workshop on Text Learning (TextML-2002), 19th International Conference on Machine Learning (ICML-2002), Sydney, Australia, July 8, 2002. Slides from this talk. An earlier version of the slides.

(Presents experiments where weights from perceptron and linear SVM models are used for feature selection, then the final model is trained using naive Bayes, perceptron, or linear SVM. Conclusions: weights from SVM models are good for feature selection even when the final training (after feature selection) is done with naive Bayes or perceptron. Feature selection strongly improves the categorization performance of naive Bayes and to a lesser extent of SVM, but it severely degrades the performance of perceptron.)

Project members:

Janez Brank | Up to the index.