Take-Away Notes for Machine Learning by Stanford University on Coursera.
Week 3, Lecture 6-7
Logistic Regression
Classification and Representation
Classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.
Given
Hypothesis Representation
Sigmoid Function
or
Logistic Function
For Binary Classifications:
aka,
represents "probability that
Decision Boundary
Def: Decision Boundary
is the line that separates the
data area of different types/labels and determined by the parameters
Decision Boundary
is only related to / determined by the parameterss of the hypothesis function, and is non-related to the data in training datasets.
Logistic Regression Model
Cost Function
It is not that suitable to use the cost function(Least Mean Square)
that applied to Linear Regression
, as the Logistic Function
will lead to wavy outputs containing many local optima. In other words,
LMS will not be a convex function when applied to Logistic
Regression.
non - convex
&convex
:
non - convex
cost functions has multiple local optima, thus making it difficult to reach the global optima;
convex
cost functions has only one optima globally.
Where for Binary Classification Problem:
y = 0
and
y = 1
and
Note that writing the cost function in this way guarantees that J(θ) is convex for Binary Logistic Regression.
Simplified Cost Function & Gradient Descent
To compress the cost function's two conditional cases into a single case:
i.e,
And a vectorized
implementation is:
Gradient Descent
General Form:
repeat {
}
until converge
Work out the derivative part using calculus:
repeat {
} until converge
A vectorized implementation is:
Advanced Optimization
Conjugate Gradient
,BFGS
(Broyden–Fletcher–Goldfarb–Shanno algorithm) andL-BFGS
(Limited-Memory BFGS) are more sophisticated but faster ways to optimizethat can be used instead of gradient descent. Advantages: No need to manually pick
; Often faster than gradient descent Disadvantages: Complex
Octave provides those advanced functions with
fminunc()
:
1 | function [jVal, gradient] = costFunction(theta) |
Then use octave's fminunc()
optimization algorithm with
optimset()
function that creates an object containing the
options that sent to fminunc
.
1 | options = optimset('GradObj', 'on', 'MaxIter', 100); |
Multi-class Classification
Multi-class Classification: One-Vs-All
Simply expand the y
's range from
Note:
To Train a logistic regression classifier
To make a prediction on a (new) x, pick the class
Regularization
Solving the Problem of Overfitting
Definition: The Problem of Overfitting
Before diving further let's understand 2 important terms:
Bias: Assumptions made by a model to make a function easier to learn. It is actually the
error rate
of the training data. When theerror rate
has a high value, we call itHigh Bias
and when theerror rate
has a low value, we call itLow Bias
.Variance: The error rate of the testing data is called variance. When the error rate has a high value, we call it High variance and when the error rate has a low value, we call it Low variance.
Overfit | Optimal | Underfit | |
---|---|---|---|
Bias in train_dataset |
Extremely Low | Low | High |
Variance in test_dataset |
High | Low | High |
Addressing Overfitting: 1. Reduce Number of Features
- Manually select which feature to keep; -
Model Selection Algorithm
(Later in Course) 2.
Regularization - Keep all the features, but reduce magnitude / values of
parameters y
.
Cost Function
If we have over fitting from hypothesis function, we can reduce the weight that some of the terms is our function carry by increasing their cost.
e.g. Say we wanted to make the following function more
quadratic(eliminate the influence of
Without actually getting rid of features or changing the form of our hypothesis, instead just modify the cost function:
thus could also regularize all of the
where lambda
, is the regularization parameter. It determines how
much the costs of our theta parameters are inflated.
if
if
Regularized Linear Regression
is non-invertible if , and may be non-invertible if .
Gradient Descent
To modify our gradient descent function to separate out
repeat{
}
where the term
Furthermore, with some manipulation the update rule/algorithm can also be represented as:
where the term
Normal Equation
To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:
where L is a matrix with 0
at the top left and
1
's down the diagonal, with 0's everywhere else with a
dimension of (n+1)
×(n+1)
.
Intuitively, this is the identity matrix (though we are not including x0x_0x0​), multiplied with a single real number λ.
Recall that if
Regularized Logistic Regression
Cost Function
Regularize the equation by adding a term to the end:
where
therefore: