Deep Learning with Python : A Hands-on Introduction

Deep Learning
with Python:
A Hands-on Introduction

by
Nikhil Ketkar

2 Machine Learning Fundamentals


Binary Classification


Consider an abstract problem :
  • We have the input and output dataset D: x --> y
  • We have only a subset of this dataset: S
  • Our task is to generate a computational procedure that implements the function f(x) = y for each (x,y) in S.
  • We can use f to make predictions over unseen dataset U which is not in S.
  • The performance of this task can be measured by mean square error
if y is 1 or -1, this is called the binary classification.

Regression


If y is a real value, this is called the regression problem.
We measure performance over this task as the root mean squared error (RMSE) over unseen data,


Consider a toy dataset:
  • Inputs x are 100 values equidistantly between -1 and 1
  • Observed outputs y are generated using y = 2 + x + 2 * x*x + ε
  • ε is noise (random variation) with a normal distribution with mean 0 and the standard deviation 0.1.
  • the first 80 data points as seen data and the rest as unseen data.
Investigate the prepared dataset:

import numpy
import matplotlib.pyplot as plt

x = numpy.linspace(-1,1,100)
signal = 2 + x + 2 * x * x
noise = numpy.random.normal(0, 0.1, 100)
y = signal + noise

plt.plot(x, signal,'b');
plt.plot(x, y,'g')
plt.plot(x, noise, 'r')
plt.xlabel("x")
plt.ylabel("y")
plt.legend(["Without Noise", "With Noise", "Noise"], loc = 2)
plt.title('dataset')
plt.show()


To derive the matrix of the prediction model:
y = w0 + w1*x + w2*(x*x)

    
                     1
    = [ w0 w1 w2] [  x   ]
                   (x*x)

        y0
        y1
  Y = [ . ]
        .
        yn
      
           1  x0  (x0*x0)
           1  x1  (x1*x1)     w0
    = [       .           ] [ w1 ]
              .               w2 
           1  xn  (xn*xn) 
                     

    = dot( X, W)

# Non-square matrices (m-by-n matrices for which m ≠ n) do not have an inverse.

  dot( transpose(X), Y ) = dot( transpose(X), dot( X, W) )
                         = dot( dot(transpose(X),X), W )

        W0
  W = [ w1 ]
        w2

   = dot(  inverse( dot(transpose(X),X) ),  dot( transpose(X), Y ))

# example to calculate the inverse of a square matrix 

x = np.array([[1,2],[3,4]]) 
print(x) 
y = np.linalg.inv(x) 
print(y) 
print(np.dot(x,y))

[[1 2]
 [3 4]]
[[-2.   1. ]
 [ 1.5 -0.5]]
[[1.00000000e+00 1.11022302e-16]
 [0.00000000e+00 1.00000000e+00]]


Train the polynomial model W with degree from 0 to 10:

x_train = x[0:80]
y_train = y[0:80]

# Setting sharex or sharey to True enables global sharing across the whole grid
#fig, axs = plt.subplots(10, sharex=True, sharey=False)


for degree in range(0,10):
    plt.figure()
    figtitle = "Polynomial with degree=%s" % (degree)
    plt.title(figtitle)
    plt.xlabel("x")
    plt.ylabel("y")
    x_train_power = numpy.column_stack( [numpy.power(x_train,i) for i in range(0,degree+1)] )
    A = numpy.linalg.inv( numpy.dot(x_train_power.transpose(),x_train_power) )
    B = numpy.dot( x_train_power.transpose(), y_train )
    model = numpy.dot( A , B ) 
    predicted = numpy.dot(model, [numpy.power(x,i) for i in range(0,degree+1)])
    plt.plot(x,y,'g')
    plt.plot(x, predicted,'r')
    plt.legend(["Actual", "Predicted"], loc = "upper left")
    train_rmse1 = numpy.sqrt(numpy.sum(numpy.dot(y[0:80] - predicted[0:80], y_train - predicted[0:80])))
    test_rmse1 = numpy.sqrt(numpy.sum(numpy.dot(y[80:] - predicted[80:], y[80:] - predicted[80:])))
    print("Train RMSE (Degree =", degree, ")", train_rmse1)
    print("Test RMSE (Degree =", degree,")", test_rmse1)
    print(model)
    plt.show()


  • Train RMSE (Degree = 0 ) 3.6448822133795784
  • Test RMSE (Degree = 0 ) 8.499796101735102
  • [2.32905563]

  • Train RMSE (Degree = 1 ) 3.5232418065459656
  • Test RMSE (Degree = 1 ) 7.499222240632106
  • [2.37426529 0.22378783]

  • Train RMSE (Degree = 2 ) 0.8223450051988356
  • Test RMSE (Degree = 2 ) 0.4911435665380988
  • [2.02625045 1.01902081 1.96820162]

  • Train RMSE (Degree = 3 ) 0.8133570004137999
  • Test RMSE (Degree = 3 ) 0.7701412446917815
  • [ 2.03830761 1.06481831 1.86510255 -0.17011346]

  • Train RMSE (Degree = 4 ) 0.8072520424521991
  • Test RMSE (Degree = 4 ) 0.6570466270147962
  • [2.04356742 0.99859357 1.75720871 0.10714783 0.34311084]

  • Train RMSE (Degree = 5 ) 0.7971522010821985
  • Test RMSE (Degree = 5 ) 3.650434931831632
  • [ 2.05962616 1.02120819 1.370635 -0.23591122 1.43750972 1.0834549 ]

  • Train RMSE (Degree = 6 ) 0.7882856453181875
  • Test RMSE (Degree = 6 ) 3.5416715812139725
  • [ 2.05843726 0.89456001 1.36952875 1.14558097 2.12845576 -1.94193724 -2.49594851]


  • Train RMSE (Degree = 7 ) 0.778397460512095
  • Test RMSE (Degree = 7 ) 14.639323705429433
  • [ 2.04332978 0.94205968 2.07830611 0.75848315 -2.90083785 -3.21085645 6.68650312 6.49330508]


  • Train RMSE (Degree = 8 ) 0.7757913639404729
  • Test RMSE (Degree = 8 ) 37.47704429743608
  • [ 2.03881617 0.86695982 2.31977351 2.30182838 -4.3104382 -11.54742806 6.08837646 19.81491111 8.24274373]


  • Train RMSE (Degree = 9 ) 0.775246943739547
  • Test RMSE (Degree = 9 ) 11.779680592275328
  • [ 2.03631828 0.90042238 2.51542746 1.58305393 -6.91584564 -8.33112949 17.83965102 18.95676124 -8.77531865 -9.35993431]

The correct model is: [2 1 2],
  • The RMSE of the test dataset is minimized when degree=2.
  • W = [2.02625045 1.01902081 1.96820162]
  • The RMSE of the train dataset is minimized when degree=9.
  • W = [ 2.03631828 0.90042238 2.51542746 1.58305393 -6.91584564 -8.33112949 17.83965102 18.95676124 -8.77531865 -9.35993431]
The data we generated was using a second order polynomial (degree = 2) with some noise. Then, we tried to approximate the data using models of degree: 1, 2, ... 9, respectively.
The result shows that a higher degree model will end up just fitting the noise in addition to the signal in the data.

In the real world, we don’t know the underlying mechanism by which the data is generated.
This leads us to the fundamental challenge in machine learning, which is, does the model truly generalize?

Regularization


The model will have a low accuracy if it is overfitting. This happens because your model is trying too hard to capture the noise in your training dataset.

In a simple relation for linear regression,
y ≈ β0 + β1*x1 + β2*x2 + …+ βp*xp
where
  • Y represents the learned relation
  • β represents the coefficient estimates for different variables or predictors(X)
A loss function, known as residual sum of squares(RSS) is defined,
        y1
  Y = [ y2 ]
        .
        yn


        β10 β11 ... β1p      1
    = [ β20 β21 ... β2p ] [ x1 ]
                ...         .
        βn0 βn1 ... βnp     xp

    = dot(β, X)


  loss = dot( (T - Y) , transpose(T - Y) )


The fitting procedure is to find the coefficients that they minimize this loss function.
The word regularize means to make things regular or acceptable.
Regularization discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.
A regularized version for the loss function:
loss = dot( (T - Y) , transpose(T - Y) ) - λ * dot(β, transpose(β))
λ is a user-defined parameter penalizing complex models.


留言

熱門文章