Efficient text categorization

Marko Grobelnik, Dunja Mladenic

We present an approach to text categorization using machine learning techniques. The approach is developed and tested on large text hierarchy named Yahoo that is available on the Web. We handle the large number of features and training examples by taking into account hierarchical structure of examples and using feature subset selection for large text data. The large number of categories is handled separately for each testing example by pruning unpromising categories. In this way, the number of categories to be considered is cut to less than a half without degrading the system performance. Our experiments are performed using naive Bayesian classifier on text data using feature-vector document representation that includes n-grams instead of just single words (unigrams). Experimental evaluation on three domains constructed from Yahoo hierarchy shows that among several hundred categories the correct category is assigned probability over 0.99 when rather small number of features used.