Domain-Tailored Machine Learning
This thesis addresses too problems of machine learning:
(1) automated setting of the parameters of learning algorithm,
and (2) generation of additional explanation in decision tree
induction.
Most machine learning algorithms are domain independent, usually
having some parameters that enable the user to adapt the algorithm
to a particular domain (e.g. noise level).
Setting these parameters is often non-trivial, specially
if the user is not familiar with a machine learning tool or does not know
all the necessary properties of her/his domain.
The first part of this work aims to help in such situations,
offering an automated setting of program parameters.
Parameter setting is defined as an optimization problem, where
the quality of induced knowledge classification accuracy is estimated
by cross-validation.
The approach is applied to the problem of setting m-value in
m-estimate post-pruning
of decision trees.
Different optimization algorithms were tested
on several well known real-world machine learning domains.
For most domains, all optimization algorithms
achieved similar classification accuracy.
The only significant difference in accuracy is on the domains
`lymphography' and `hepatitis', where enumerative search
and default setting are the worst on the first, and genetic search
in the second domain.
Because of the small differences between the algorithms we cannot
recommend any categorical choice among them.
Search space analysis shows a sort of classification accuracy
plateaus (where for small changes in m-value accuracy remains the same),
that is consistent with the similarity of results obtained by the
tested optmization algorithms. It also shows
almost monotonically decrease of the tree size
and information score with increasing m-value.
The relation between m and accuracy often appears rather unstable in
the sense of different partitioning of the learning data produce highly
dissimilar behaviours of accuracy vs. m-value.
Despite this instability some
regularities in the relation between classification accuracy and m-value
were detected.
When using machine learning for knowledge acquisition,
consultation with a domain expert is almost unavoidable since
the success of learning is mainly evaluated by the expert.
In this evaluation process a
domain expert is often disappointed with the
parsimony of induced decision trees
and wants additional
explanation. Frequently asked questions are about additional
properties of generated data subsets (e.g. examples in tree nodes).
The second part of this work proposes a method for
decision tree explanation, where for each node of the tree under
consideration a new problem is defined and decision tree induced.
The proposed method is tested on the discrete event simulator
analysis, where a domain expert evaluated induced knowledge and
found it very useful for understanding the simulated system.