Take-Away Notes for Machine Learning by Stanford University on Coursera.
Week 5, Lecture 9
Cost Function and Back Propagation
Cost Function
Recall that the cost function for
Regularized Logistic Regression
:
Here comes the cost function for Neural Network
:
Where:
In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta matrices.
The number of columns in our current theta matrix
is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix
is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term(
Least Mean Square Error
).
Note: - the double sum
Back Propagation Algorithm
Need further implementation on Comparisons between Back & Front Propagation.
Given Training Dataset
- Set
(for all , , )
For training example
- Set
- Perform
Forward Propagation
to computefor - Using
, compute - Compute
using
Hence to update a new
Where the capital-delta matrix
Back Propagation Intuition
Omitted Until Further Implementation.
Chip: Actually I think Andrew Ng's Explanation is NOT that enough. More notes are on the way, but won't in this chapter.
Back Propagation in Practice
Implementation Note: Unrolling Parameters
In order to use optimizing functions such as fminunc()
in octave, it's necessary to "unroll" all the elements into one long
vector:
1 | thetaVector = [Theta1(:); Theta2(:); Theta3(:)] |
In order to get back original matrices, use function
reshape
:
1 | Theta1 = reshape(thetaVector(1:110), 10, 11) % 10 * 11 Vector |
Gradient Checking
To approximate the derivative of cost function:
and a ern with multiple theta matrixs
This method is designed just to verify/validate the back propagation
algorithm. There is no need to re-compute gradApprox(i)
though as due to slow computation speed.
Random Initialization
Why Not Zeros?
Omitted until further implementation.
Conclusions in a nutshell: All-Same
Symmetry Breaking
Initialize each
(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)
Putting Together...
To Train A Neural Network
- Randomly initialize
Weights
; - Implement
Forward Propagation
to getfor any ; - Implement
Cost Function
; - Implement
Forward Propagation
&Backward Propagation
to compute Partial Derivatives; - Use
Gradient Checking
to confirm / validate that the Backward Propagation Algorithm works efficiently(Comparewith Numerical Estimate of Gradient of aka ). Then disable Gradient Checking
; - Use
Gradient Descent
/Advanced Optimization
Method withBackward Propagation
to;
Application of Neural Networks
Autonomous Driving
Omitted.