ML by Stanford: Wk5
Ch'i YU Lv3

Take-Away Notes for Machine Learning by Stanford University on Coursera.

Week 5, Lecture 9

Cost Function and Back Propagation

Cost Function

Recall that the cost function for Regularized Logistic Regression:

Here comes the cost function for Neural Network:

Where:

total number of layers in the network;

number of units(not counting bias unit) in layer ;

number of output units/classes/categories;


In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.

In the regularization part, after the square brackets, we must account for multiple theta matrices.

The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit).

The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit).

As before with logistic regression, we square every term(Least Mean Square Error).

Note: - the double sum simply adds up the logistic regression costs calculated for each cell in the output layer; - the triple sum simply adds up the squares of all the individual Θs in the entire network; - the in the triple sum does not refer to training example i, but refers to the nodes, e.g. the ith node in layer j;

Back Propagation Algorithm

Need further implementation on Comparisons between Back & Front Propagation.

Given Training Dataset :

  1. Set (for all , , )

For training example :

  1. Set
  2. Perform Forward Propagation to compute for
  3. Using , compute
  4. Compute using

Hence to update a new matrix:

if

if

Where the capital-delta matrix is used as an "accumulator" to add up values along and eventually compute the partial derivative, therefore get eventually.

Back Propagation Intuition

Omitted Until Further Implementation.

Chip: Actually I think Andrew Ng's Explanation is NOT that enough. More notes are on the way, but won't in this chapter.

Back Propagation in Practice

Implementation Note: Unrolling Parameters

In order to use optimizing functions such as fminunc() in octave, it's necessary to "unroll" all the elements into one long vector:

1
2
thetaVector = [Theta1(:); Theta2(:); Theta3(:)]
deltaVector = [D1(:); D2(:); D3(:)]

In order to get back original matrices, use function reshape:

1
2
3
Theta1 = reshape(thetaVector(1:110), 10, 11)    % 10 * 11 Vector
Theta2 = reshape(thetaVector(111:220), 10, 11) % 10 * 11 Vector
Theta3 = reshape(thetaVector(221:231), 1, 11) % 1 * 11 Vector

Gradient Checking

To approximate the derivative of cost function:

and a ern with multiple theta matrixs respect to as follows:

This method is designed just to verify/validate the back propagation algorithm. There is no need to re-compute gradApprox(i) though as due to slow computation speed.

Random Initialization

Why Not Zeros?

Omitted until further implementation.

Conclusions in a nutshell: All-Same values will make the attempts of upgrading effortless.

Symmetry Breaking

Initialize each to a random value in to break symmetry.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

Putting Together...

To Train A Neural Network

  1. Randomly initialize Weights;
  2. Implement Forward Propagation to get for any ;
  3. Implement Cost Function ;
  4. Implement Forward Propagation & Backward Propagation to compute Partial Derivatives;
  5. Use Gradient Checking to confirm / validate that the Backward Propagation Algorithm works efficiently(Compare with Numerical Estimate of Gradient of aka ). Then disable Gradient Checking;
  6. Use Gradient Descent / Advanced Optimization Method with Backward Propagation to ;

Application of Neural Networks

Autonomous Driving

Omitted.