The sample complexity of pattern classification with neuralnetworks: the size of the weights is more important than the size of thenetwork
Bartlett, P.L.
Information Theory, IEEE Transactions on
Volume 44, Issue 2, Mar 1998 Page(s):525 - 536
Digital Object Identifier 10.1109/18.661502
Summary:Sample complexity results from computational learning theory, when
applied to neural network learning for pattern classification problems,
suggest that for good generalization performance the number of training
examples should grow at least linearly with the number of adjustable
parameters in the network. Results in this paper show that if a large
neural network is used for a pattern classification problem and the
learning algorithm finds a network with small weights that has small
squared error on the training patterns, then the generalization
performance depends on the size of the weights rather than the number of
weights. For example, consider a two-layer feedforward network of
sigmoid units, in which the sum of the magnitudes of the weights
associated with each unit is bounded by A and the input dimension is n.
We show that the misclassification probability is no more than a certain
error estimate (that is related to squared error on the training set)
plus A3 √((log n)/m) (ignoring log A and log m
factors), where m is the number of training patterns. This may explain
the generalization performance of neural networks, particularly when the
number of training examples is considerably smaller than the number of
weights. It also supports heuristics (such as weight decay and early
stopping) that attempt to keep the weights small during training. The
proof techniques appear to be useful for the analysis of other pattern
classifiers: when the input domain is a totally bounded metric space, we
use the same approach to give upper bounds on misclassification
probability for classifiers with decision boundaries that are far from
the training examples
View citation and abstract |