11月 11, 2019

Deep Learning with Python : A Hands-on Introduction

Deep Learning
with Python:
A Hands-on Introduction

by
Nikhil Ketkar

2 Machine Learning Fundamentals

Binary Classification

Consider an abstract problem :

We have the input and output dataset D: x --> y
We have only a subset of this dataset: S
Our task is to generate a computational procedure that implements the function f(x) = y for each (x,y) in S.
We can use f to make predictions over unseen dataset U which is not in S.
The performance of this task can be measured by mean square error

if y is 1 or -1, this is called the binary classification.

Regression

If y is a real value, this is called the regression problem.
We measure performance over this task as the root mean squared error (RMSE) over unseen data,

Consider a toy dataset:

Inputs x are 100 values equidistantly between -1 and 1
Observed outputs y are generated using y = 2 + x + 2 * x*x + ε
the first 80 data points as seen data and the rest as unseen data.

Investigate the prepared dataset:


import numpy
import matplotlib.pyplot as plt

x = numpy.linspace(-1,1,100)
signal = 2 + x + 2 * x * x
noise = numpy.random.normal(0, 0.1, 100)
y = signal + noise

plt.plot(x, signal,'b');
plt.plot(x, y,'g')
plt.plot(x, noise, 'r')
plt.xlabel("x")
plt.ylabel("y")
plt.legend(["Without Noise", "With Noise", "Noise"], loc = 2)
plt.title('dataset')
plt.show()

To derive the matrix of the prediction model:

y = w0 + w1*x + w2*(x*x)

    
                     1
    = [ w0 w1 w2] [  x   ]
                   (x*x)

        y0
        y1
  Y = [ . ]
        .
        yn
      
           1  x0  (x0*x0)
           1  x1  (x1*x1)     w0
    = [       .           ] [ w1 ]
              .               w2 
           1  xn  (xn*xn) 
                     

    = dot( X, W)

# Non-square matrices (m-by-n matrices for which m ≠ n) do not have an inverse.

  dot( transpose(X), Y ) = dot( transpose(X), dot( X, W) )
                         = dot( dot(transpose(X),X), W )

        W0
  W = [ w1 ]
        w2

   = dot(  inverse( dot(transpose(X),X) ),  dot( transpose(X), Y ))

# example to calculate the inverse of a square matrix 

x = np.array([[1,2],[3,4]]) 
print(x) 
y = np.linalg.inv(x) 
print(y) 
print(np.dot(x,y))

[[1 2]
 [3 4]]
[[-2.   1. ]
 [ 1.5 -0.5]]
[[1.00000000e+00 1.11022302e-16]
 [0.00000000e+00 1.00000000e+00]]

Train the polynomial model W with degree from 0 to 10:


x_train = x[0:80]
y_train = y[0:80]

# Setting sharex or sharey to True enables global sharing across the whole grid
#fig, axs = plt.subplots(10, sharex=True, sharey=False)


for degree in range(0,10):
    plt.figure()
    figtitle = "Polynomial with degree=%s" % (degree)
    plt.title(figtitle)
    plt.xlabel("x")
    plt.ylabel("y")
    x_train_power = numpy.column_stack( [numpy.power(x_train,i) for i in range(0,degree+1)] )
    A = numpy.linalg.inv( numpy.dot(x_train_power.transpose(),x_train_power) )
    B = numpy.dot( x_train_power.transpose(), y_train )
    model = numpy.dot( A , B ) 
    predicted = numpy.dot(model, [numpy.power(x,i) for i in range(0,degree+1)])
    plt.plot(x,y,'g')
    plt.plot(x, predicted,'r')
    plt.legend(["Actual", "Predicted"], loc = "upper left")
    train_rmse1 = numpy.sqrt(numpy.sum(numpy.dot(y[0:80] - predicted[0:80], y_train - predicted[0:80])))
    test_rmse1 = numpy.sqrt(numpy.sum(numpy.dot(y[80:] - predicted[80:], y[80:] - predicted[80:])))
    print("Train RMSE (Degree =", degree, ")", train_rmse1)
    print("Test RMSE (Degree =", degree,")", test_rmse1)
    print(model)
    plt.show()

Train RMSE (Degree = 0 ) 3.6448822133795784 Test RMSE (Degree = 0 ) 8.499796101735102 [2.32905563]
Train RMSE (Degree = 1 ) 3.5232418065459656 Test RMSE (Degree = 1 ) 7.499222240632106 [2.37426529 0.22378783]
Train RMSE (Degree = 2 ) 0.8223450051988356 Test RMSE (Degree = 2 ) 0.4911435665380988 [2.02625045 1.01902081 1.96820162]
Train RMSE (Degree = 3 ) 0.8133570004137999 Test RMSE (Degree = 3 ) 0.7701412446917815 [ 2.03830761 1.06481831 1.86510255 -0.17011346]
Train RMSE (Degree = 4 ) 0.8072520424521991 Test RMSE (Degree = 4 ) 0.6570466270147962 [2.04356742 0.99859357 1.75720871 0.10714783 0.34311084]
Train RMSE (Degree = 5 ) 0.7971522010821985 Test RMSE (Degree = 5 ) 3.650434931831632 [ 2.05962616 1.02120819 1.370635 -0.23591122 1.43750972 1.0834549 ]
Train RMSE (Degree = 6 ) 0.7882856453181875 Test RMSE (Degree = 6 ) 3.5416715812139725 [ 2.05843726 0.89456001 1.36952875 1.14558097 2.12845576 -1.94193724 -2.49594851]
Train RMSE (Degree = 7 ) 0.778397460512095 Test RMSE (Degree = 7 ) 14.639323705429433 [ 2.04332978 0.94205968 2.07830611 0.75848315 -2.90083785 -3.21085645 6.68650312 6.49330508]
Train RMSE (Degree = 8 ) 0.7757913639404729 Test RMSE (Degree = 8 ) 37.47704429743608 [ 2.03881617 0.86695982 2.31977351 2.30182838 -4.3104382 -11.54742806 6.08837646 19.81491111 8.24274373]
Train RMSE (Degree = 9 ) 0.775246943739547 Test RMSE (Degree = 9 ) 11.779680592275328 [ 2.03631828 0.90042238 2.51542746 1.58305393 -6.91584564 -8.33112949 17.83965102 18.95676124 -8.77531865 -9.35993431]

The correct model is: [2 1 2],

The RMSE of the test dataset is minimized when degree=2.
The RMSE of the train dataset is minimized when degree=9.

The data we generated was using a second order polynomial (degree = 2) with some noise. Then, we tried to approximate the data using models of degree: 1, 2, ... 9, respectively.
The result shows that a higher degree model will end up just fitting the noise in addition to the signal in the data.

In the real world, we don’t know the underlying mechanism by which the data is generated.
This leads us to the fundamental challenge in machine learning, which is, does the model truly generalize?

Regularization

The model will have a low accuracy if it is overfitting. This happens because your model is trying too hard to capture the noise in your training dataset.

In a simple relation for linear regression,

y ≈ β0 + β1*x1 + β2*x2 + …+ βp*xp

where

Y represents the learned relation
β represents the coefficient estimates for different variables or predictors(X)

A loss function, known as residual sum of squares(RSS) is defined,

        y1
  Y = [ y2 ]
        .
        yn


        β10 β11 ... β1p      1
    = [ β20 β21 ... β2p ] [ x1 ]
                ...         .
        βn0 βn1 ... βnp     xp

    = dot(β, X)


  loss = dot( (T - Y) , transpose(T - Y) )

The fitting procedure is to find the coefficients that they minimize this loss function.
The word regularize means to make things regular or acceptable.
Regularization discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.
A regularized version for the loss function:

loss = dot( (T - Y) , transpose(T - Y) ) - λ * dot(β, transpose(β))

λ is a user-defined parameter penalizing complex models.

搜尋此網誌

I'm Jay's father