Take-Away Notes for Machine Learning by Stanford University on Coursera.

Week 3, Lecture 6-7

Logistic Regression

Classification and Representation

Classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.

Given , the corresponding is also called the label for the training example.

Hypothesis Representation

where g is Sigmoid Function or Logistic Function

For Binary Classifications:

aka,

represents "probability that belongs to when given and parameterized by ".

Decision Boundary

Def: Decision Boundary is the line that separates the data area of different types/labels and determined by the parameters s of the hypothesis function.

Decision Boundary is only related to / determined by the parameters s of the hypothesis function, and is non-related to the data in training datasets.

Logistic Regression Model

Cost Function

It is not that suitable to use the cost function(Least Mean Square) that applied to Linear Regression, as the Logistic Function will lead to wavy outputs containing many local optima. In other words, LMS will not be a convex function when applied to Logistic Regression.

non - convex & convex:

non - convex cost functions has multiple local optima, thus making it difficult to reach the global optima;

convex cost functions has only one optima globally.

Where for Binary Classification Problem:

if y = 0 and

if y = 1 and

Note that writing the cost function in this way guarantees that J(θ) is convex for Binary Logistic Regression.

Simplified Cost Function & Gradient Descent

To compress the cost function's two conditional cases into a single case:

i.e,

And a vectorized implementation is:

Gradient Descent

General Form:

repeat {

}

until converge

Work out the derivative part using calculus:

repeat {

} until converge

A vectorized implementation is:

Advanced Optimization

Conjugate Gradient, BFGS(Broyden–Fletcher–Goldfarb–Shanno algorithm) and L-BFGS(Limited-Memory BFGS) are more sophisticated but faster ways to optimize that can be used instead of gradient descent.

Advantages: No need to manually pick ; Often faster than gradient descent

Disadvantages: Complex

Octave provides those advanced functions with fminunc():

function [jVal, gradient] = costFunction(theta)
  jVal = [......];        % Code to compute J(theta)
  gradient = [......];    % Code to compute derivative of J(theta)
end

Then use octave's fminunc() optimization algorithm with optimset() function that creates an object containing the options that sent to fminunc.

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2, 1); % For Vector of Theta
  [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

% Pass the cost function,
% the initial vector of theta values,
% and the "options" object that created beforehand
$ to the "fminunc()";

Multi-class Classification

Multi-class Classification: One-Vs-All

Simply expand the y 's range from to .

Note:

To Train a logistic regression classifier , for each class to predict the probability that ;

To make a prediction on a (new) x, pick the class that maximizes .

Regularization

Solving the Problem of Overfitting

Definition: The Problem of Overfitting

The Best Way to Explain Overfitting Meme

Before diving further let's understand 2 important terms:

Bias: Assumptions made by a model to make a function easier to learn. It is actually the error rate of the training data. When the error rate has a high value, we call it High Bias and when the error rate has a low value, we call it Low Bias.

Variance: The error rate of the testing data is called variance. When the error rate has a high value, we call it High variance and when the error rate has a low value, we call it Low variance.

	Overfit	Optimal	Underfit
Bias in `train_dataset`	Extremely Low	Low	High
Variance in `test_dataset`	High	Low	High

Addressing Overfitting: 1. Reduce Number of Features - Manually select which feature to keep; - Model Selection Algorithm (Later in Course) 2. Regularization - Keep all the features, but reduce magnitude / values of parameters ; - Works well when we have a lot of features, each of wich contributes a bit to predicting y.

Cost Function

If we have over fitting from hypothesis function, we can reduce the weight that some of the terms is our function carry by increasing their cost.

e.g. Say we wanted to make the following function more quadratic(eliminate the influence of and ):

Without actually getting rid of features or changing the form of our hypothesis, instead just modify the cost function:

thus could also regularize all of the parameters ino a single summation as:

where , or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

if or is too small: leads to meaningless regularization

if or is too small: could lead to underfit / high bias

Regularized Linear Regression

is non-invertible if , and may be non-invertible if .

Gradient Descent

To modify our gradient descent function to separate out from the rest of the parameters but do not penalize :

repeat{

}

where the term performs regularization;

Furthermore, with some manipulation the update rule/algorithm can also be represented as:

where the term is always less than 1 as , and are all positive.

Normal Equation

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

where L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else with a dimension of (n+1)×(n+1).

Intuitively, this is the identity matrix (though we are not including x0x_0x0), multiplied with a single real number λ.

Recall that if , then is non-invertible. However, when we add the term , then + becomes invertible.

Regularized Logistic Regression

Cost Function

Regularize the equation by adding a term to the end:

where is explicitly excluded by by running from 1 to n and skipping 0.

therefore: