Text Categorization

The work presented here focuses on feature selection for text categorization, in particular on using feature weights from linear models (obtained e.g. on a subset of the training set) to decide which features to retain and which to discard. The experiments have been performed on large subsets of the Reuters Corpus, Volume 1. In comparing different feature selection methods, we emphasize the notion of "sparsity", i.e. the average number of nonzero features per document, as an independent variable, rather than the number of features. This is because training methods such as the SVM are not very sensitive to the number of features if the documents used are sparse.

Publications:

Project members:

Janez Brank | Up to the index.