Efficient text categorization
Marko Grobelnik, Dunja Mladenic
We present an approach to text categorization using machine learning
techniques. The approach is developed and tested on large text hierarchy
named Yahoo that is available on the Web.
We handle the large number of features and training examples by taking
into account hierarchical structure of examples and using feature subset
selection for large text data. The large number of categories is handled
separately for each testing example by pruning unpromising categories.
In this way, the number of categories to be considered is cut to less
than a half without degrading the system performance.
Our experiments are performed using naive
Bayesian classifier on text data using feature-vector document representation
that includes n-grams instead of just single words (unigrams).
Experimental evaluation on three domains constructed from Yahoo hierarchy shows
that among several hundred categories the correct category is assigned
probability over 0.99 when rather small number of features used.