Feature selection for classification based on text hierarchy
Dunja Mladenic, Marko Grobelnik
This paper describes automatic document categorization based on large text hierarchy.
We handle the large number of features and training examples by taking
into account hierarchical structure of examples and using feature selection
for large text data.
We experimentally evaluate feature subset selection
on real-world text data collected from the existing Web hierarchy named Yahoo.
In our learning experiments naive
Bayesian classifier was used on text data using feature-vector document representation
that includes n-grams instead of just single words (unigrams).
Experimental evaluation on real-world data collected form the Web
shows that our approach gives promising results and can potentially be
used for document categorization on the Web.
Additionally the best result on our data is achieved for relatively
small feature subset, while for larger subset the performance substantially drops.
The best performance among six tested feature scoring measure was achieved
by the feature scoring measure called Odds ratio that is known from information retrieval.