Domain-Tailored Machine Learning

This thesis addresses too problems of machine learning: (1) automated setting of the parameters of learning algorithm, and (2) generation of additional explanation in decision tree induction. Most machine learning algorithms are domain independent, usually having some parameters that enable the user to adapt the algorithm to a particular domain (e.g. noise level). Setting these parameters is often non-trivial, specially if the user is not familiar with a machine learning tool or does not know all the necessary properties of her/his domain. The first part of this work aims to help in such situations, offering an automated setting of program parameters. Parameter setting is defined as an optimization problem, where the quality of induced knowledge classification accuracy is estimated by cross-validation. The approach is applied to the problem of setting m-value in m-estimate post-pruning of decision trees. Different optimization algorithms were tested on several well known real-world machine learning domains. For most domains, all optimization algorithms achieved similar classification accuracy. The only significant difference in accuracy is on the domains `lymphography' and `hepatitis', where enumerative search and default setting are the worst on the first, and genetic search in the second domain. Because of the small differences between the algorithms we cannot recommend any categorical choice among them. Search space analysis shows a sort of classification accuracy plateaus (where for small changes in m-value accuracy remains the same), that is consistent with the similarity of results obtained by the tested optmization algorithms. It also shows almost monotonically decrease of the tree size and information score with increasing m-value. The relation between m and accuracy often appears rather unstable in the sense of different partitioning of the learning data produce highly dissimilar behaviours of accuracy vs. m-value. Despite this instability some regularities in the relation between classification accuracy and m-value were detected. When using machine learning for knowledge acquisition, consultation with a domain expert is almost unavoidable since the success of learning is mainly evaluated by the expert. In this evaluation process a domain expert is often disappointed with the parsimony of induced decision trees and wants additional explanation. Frequently asked questions are about additional properties of generated data subsets (e.g. examples in tree nodes). The second part of this work proposes a method for decision tree explanation, where for each node of the tree under consideration a new problem is defined and decision tree induced. The proposed method is tested on the discrete event simulator analysis, where a domain expert evaluated induced knowledge and found it very useful for understanding the simulated system.