Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

by
Christopher M. Bishop

1 Introduction


Consider the example of recognizing handwritten digits. Each digit corresponds to a 28×28 pixel image and so can be represented by a vector x comprising 784 real numbers. The goal is to build a machine that will take such a vector x as input and that will produce the identity of the digit 0, . . . , 9 as the output.
By adopting a machine learning approach, a large set of N digits {x1 , . . . , xN } called a training set is used to tune the parameters of an adaptive model. The categories of the digits in the training set are known in advance, there is one target vector t for each digit image x.
The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y ( encoded in the same way as the target vectors ). The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data.

The original input variables are typically pre-processed to transform them into some new space of variables which makes it much easier for a subsequent pattern recognition algorithm to distinguish between the different classes. This pre-processing stage is sometimes also called feature extraction. Pre-processing might also be performed in order to speed up computation.

For supervised learning, each input vector has a corresponding output target vector in the training dataset.
According to the desired target, the task can be called:
  • classification
  • To assign each input vector to one of a finite number of discrete categories.
  • regression
  • The desired output consists of one or more continuous variables.

The technique of reinforcement learning is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward.

1.1 Example: Polynomial Curve Fitting . . . . . . . . . . . . . . . . . 4


Consider a training set :
  • input
  • N observations of x = [ x1 x2 .. xn ]
  • target
  • data set t was obtained by first computing the corresponding values of the function sin(2πx) and then adding a small level of random noise having a Gaussian distribution to each such point in order to obtain the corresponding value tn .

Plot of a training data set of N =10 points, shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the data. Our goal is to predict the value of t for some new value of x, without knowledge of the green curve.

Let's use a linear model to fit the curve:
y(x, W) = w0 + w1 * x + w2 * (x*x) + ...
Note that, although the polynomial function y(x, W) is a nonlinear function of x, it is a linear function of the coefficients W.
The values of the coefficients will be determined by fitting the polynomial to the training data.
This can be done by minimizing an error function that measures the misfit between the function y(x, W) and the training set data points, for any given value of W.


We can solve the curve fitting problem by choosing the value of W for which E(W) is as small as possible.
There remains the problem of choosing the order M of the polynomial for x.
The following plot shows 4 examples of the results of fitting polynomials having orders M = 0, 1, 3, and 9 to the data set:

  • M=0 and M=1 give poor representations of the function sin(2πx).
  • M=3 gives the best fit to the function sin(2πx)
  • M=9 gives an excellent fit to the training data and gives a very poor representation of the function sin(2πx).
  • This is called over-fitting.
Graphs of the root-mean-square error,

Using the M = 9 polynomial for N = 15 data points (left plot) and N = 100 data points (right plot):

We see that, for a given model complexity, the over-fitting problem become less severe as the size of the data set increases.
By examining the values of the coefficients, as M increases, the magnitude of the coefficients typically gets larger:

One technique that is often used to control the over-fitting phenomenon in such cases is regularization which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values.
The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form

Techniques such as this are known in the statistics literature as shrinkage methods.
In the context of neural networks, this approach is known as weight decay.


1.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 12


A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets.


Bayes’ theorem:


We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability over all values of Y equals one.
An illustration of a distribution over two variables, X, which takes 9 possible values, and Y , which takes 2 possible values.



1.2.1 Probability densities . . . . . . . . . . . . . . . . . . . . . 17


If the probability of a real-valued variable x falling in the interval (x, x + δx) is given by p(x)δx for δx → 0, then p(x) is called the probability density function(PDF) over x.
The concept of probability for discrete variables can be extended to that of a probability density p(x) over a continuous variable x and is such that the probability of x lying in the interval (x, x + δx) is given by p(x)δx for δx → 0.
The probability density can be expressed as the derivative of a cumulative distribution function P(x).

The probability that x lies in the interval (−∞, z) is given by the cumulative
distribution function defined by

P'(x) = p(x)
Unlike a probability, a probability density function can take on values greater than one; for example, the uniform distribution on the interval [0, 1/2] has probability density:
f(x) = 2 for 0 ≤ x ≤ 1/2
and 
 f(x) = 0 elsewhere.

1.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 19


The average value of some function f (x) under a probability distribution p(x) is called the expectation of f (x).

The variance of f (x) provides a measure of how much variability there is in f (x) around its mean value.

For two random variables x and y, the covariance is defined and expresses the extent to which x and y vary together. If x and y are independent, then their covariance vanishes.


1.2.3 Bayesian probabilities . . . . . . . . . . . . . . . . . . . . 21


In statistics,
  • The likelihood function (often simply called the likelihood) expresses the probability of a sample of data given a set of model parameter values. The likelihood is typically formulated as a function of the parameters.
  • The maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.
Bayes’ theorem, which takes the form

p(D|w) is the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector w.
In maximum likelihood, w is set to the value that maximizes the likelihood function p(D|w).
In machine learning, maximizing the likelihood is equivalent to minimizing the error.

1.2.4 The Gaussian distribution . . . . . . . . . . . . . . . . . . 24

1.2.5 Curve fitting re-visited . . . . . . . . . . . . . . . . . . . . 28

1.2.6 Bayesian curve fitting . . . . . . . . . . . . . . . . . . . . 30

1.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.4 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . 33

1.5 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

1.5.1 Minimizing the misclassification rate . . . . . . . . . . . . 39

1.5.2 Minimizing the expected loss . . . . . . . . . . . . . . . . 41

1.5.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 42

1.5.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 42

1.5.5 Loss functions for regression . . . . . . . . . . . . . . . . . 46

1.6 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.6.1 Relative entropy and mutual information


2 Probability Distributions


A common modeling problem involves how to estimate a joint probability distribution for a dataset.


2.1 Binary Variables . . . . . . . . . . . . . . . . . . .

2.1.1 The beta distribution . . . . . . . . . . . . .

2.2 Multinomial Variables . . . . . . . . . . . . . . . .

2.2.1 The Dirichlet distribution . . . . . . . . . . .

2.3 The Gaussian Distribution . . . . . . . . . . . . . .

2.3.1 Conditional Gaussian distributions . . . . . .

2.3.2 Marginal Gaussian distributions . . . . . . .

2.3.3 Bayes’ theorem for Gaussian variables . . . .

2.3.4 Maximum likelihood for the Gaussian . . . .

2.3.5 Sequential estimation . . . . . . . . . . . . .

2.3.6 Bayesian inference for the Gaussian . . . . .

2.3.7 Student’s t-distribution . . . . . . . . . . . .

2.3.8 Periodic variables . . . . . . . . . . . . . . .

2.3.9 Mixtures of Gaussians . . . . . . . . . . . .

2.4 The Exponential Family . . . . . . . . . . . . . . .

2.4.1 Maximum likelihood and sufficient statistics

2.4.2 Conjugate priors . . . . . . . . . . . . . . .

2.4.3 Noninformative priors . . . . . . . . . . . .

2.5 Nonparametric Methods . . . . . . . . . . . . . . .

2.5.1 Kernel density estimators . . . . . . . . . . .

2.5.2 Nearest-neighbour methods . . . .


3 Linear Models for Regression


The unsupervised learning includes topics such as density estimation and data clustering.
We turn now to a discussion of supervised learning, starting with regression.
We use correlation to denote association between two quantitative variables.
Regression involves estimating the best line to summarize the association.

The goal of regression is to predict the value of one or more continuous target variables t given the value of a D-dimensional vector x of input variables.
To find an appropriate function y(x) whose values t are predictions for new inputs x is similar to model the predictive distribution :
p(t|x).

3.1 Linear Basis Function Models . . . . . . . . .


The linear model for regression

By using nonlinear basis functions, we allow the function y(x, w) to be a non-linear function of the input vector x.
There are many other possible choices for the basis functions, for example, e sigmoidal or the ‘tanh’ functions.

Linear regression is a standard modeling method.
It is a model that maps one or more numerical inputs to a numerical output.
The model is defined in terms of parameters called coefficients, where there is one coefficient per input and an additional coefficient that provides the bias.
A given input is predicted as the weighted sum of the inputs and the coefficients.
The training sample is known to be incomplete so that it is expected to have measurement error or statistical noise in the observations.

The parameters of the model(coefficients) must be estimated from the sample of observations.
There are many ways to estimate the parameters, the most 2 common frameworks are:
  • Least Squares Optimization.
  • Least squares optimization is an approach by seeking a set of parameters that results in the smallest squared error between the predictions of the model and the known outputs, averaged over all known inputs in the dataset, so-called mean squared error.
  • Maximum Likelihood Estimation.
  • Maximum Likelihood Estimation seeks a set of parameters for the model that maximize a likelihood function.
Both frameworks are optimization procedures that involve searching for different model parameters.
Different optimization algorithms may be used for these frameworks, such as the BFGS algorithm (or variants), and general optimization methods like stochastic gradient descent.

3.1.1 Maximum likelihood and least squares ... 140


Estimation就是假設一種機率分布以找到適當的參數,盡量使已知的樣本能被預測。
Density estimation involves selecting :
  • a probability distribution function
  • the parameters of that distribution
that best explain the joint probability distribution of the observed data.
This problem is made more challenging as sample drawn from the population is small and has noise.
There are many techniques for solving this problem, two common approaches are:
  • Maximum a Posteriori (MAP), a Bayesian method.
  • Maximum Likelihood Estimation (MLE), frequentist method.
In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters Θ,
P(X| Θ)
This is referred to as the likelihood of observing the data given the model parameters.
The objective of Maximum Likelihood Estimation is to find the set of parameters (Θ) that maximize the likelihood function
P(X| Θ)
=   P(x1,x2,...xn| Θ)
=   P(x1| Θ) *  P(x2| Θ) ... *  P(xn| Θ)

  log( P(X| Θ) )
= log( P(x1| Θ) ) + log( P(x2| Θ) ) + ... + log( P(xn| Θ) )


3.1.2 Geometry of least squares . . . . . . .

3.1.3 Sequential learning . . . . . . . . . . .

3.1.4 Regularized least squares . . . . . . . .

3.1.5 Multiple outputs . . . . . . . . . . . .

3.2 The Bias-Variance Decomposition . . . . . . .

3.3 Bayesian Linear Regression . . . . . . . . . .

3.3.1 Parameter distribution . . . . . . . . .

3.3.2 Predictive distribution . . . . . . . . .

3.3.3 Equivalent kernel . . . . . . . . . . . .

3.4 Bayesian Model Comparison . . . . . . . . . .

3.5 The Evidence Approximation . . . . . . . . .

3.5.1 Evaluation of the evidence function . .

3.5.2 Maximizing the evidence function . . .

3.5.3 Effective number of parameters . . . .

3.6 Limitations of Fixed Basis Functions . .


4 Linear Models for Classification

4.1 Discriminant Functions . . . . . . . . . . . . . .

4.1.1 Two classes . . . . . . . . . . . . . . . .

4.1.2 Multiple classes . . . . . . . . . . . . . .

4.1.3 Least squares for classification . . . . . .

4.1.4 Fisher’s linear discriminant . . . . . . . .

4.1.5 Relation to least squares . . . . . . . . .

4.1.6 Fisher’s discriminant for multiple classes

4.1.7 The perceptron algorithm . . . . . . . . .

4.2 Probabilistic Generative Models . . . . . . . . .

4.2.1 Continuous inputs . . . . . . . . . . . .

4.2.2 Maximum likelihood solution . . . . . .

4.2.3 Discrete features . . . . . . . . . . . . .

4.2.4 Exponential family . . . . . . . . . . . .

4.3 Probabilistic Discriminative Models . . . . . . .

4.3.1 Fixed basis functions . . . . . . . . . . .

4.3.2 Logistic regression . . . . . . . . . . . .


Unlike linear regression, which is used to make a prediction on the numeric response, logistic regression is used to solve a classification problem.
Logistic regression makes a prediction on the probability, we can label the result according to the probability: > 50% or not.
The range of linear regression is from negative infinite to positive infinite, then, sigmoid function is introduced to solve this problem.


4.3.3 Iterative reweighted least squares . . . .

4.3.4 Multiclass logistic regression . . . . . . .

4.3.5 Probit regression . . . . . . . . . . . . .

4.3.6 Canonical link functions . . . . . . . . .

4.4 The Laplace Approximation . . . . . . . . . . .

4.4.1 Model comparison and BIC . . . . . . .

4.5 Bayesian Logistic Regression . . . . . . . . . .

4.5.1 Laplace approximation . . . . . . . . . .

4.5.2 Predictive distribution . . . . . . . . . .




留言

熱門文章