Logistic Regression

Suppose we had a collection of u×v image, that is, it has u rows of pixels, each row containing v pixels. Suppose we want to build a binary classifier model that takes such images, and maps it into two classes {1 - cat , 2 - not cat}.

First, we have to collect the pixel RGB value into a column vector. We do this by reading the image matrix left to right, top to bottom, say R then G then B, and so we have an input vector x of dimension nx=3uv.

our training data will be pairs (xi,yi) for 1im. where yi=1 only if xi is a cat, and yi=0 otherwise.
so we will store both xi and yi as columns of matrices. $$X =\begin{bmatrix}
\vdots \ \ \ \ \vdots \ \ \ \dots \ \ \ \ \vdots \
x^1 \ x^2 \ \dots \ x^m \
\vdots \ \ \ \ \vdots \ \ \ \dots \ \ \ \ \vdots \
\end{bmatrix}$$

Notice that each xi is stored as a column vector, and each yi is also (a 1 dimensional) column vector.

Y=[y1,y2,,ym]

A model for binary classification, could take in a weight vector w and maybe do a linear regression y^i=wTxi+b. notice that w,xRnx (notice that we will use hat notation to depict model outputs)

But the issue then is, that we want yi^ to somehow capture/approximate the probability that xi is a cat. so we want y^i=Pr(yi=1|xi). And so we want the output to be bounded in [0,1].

Here is where we use the sigmoid function $$\sigma(z) = \frac{1}{1+e^{-z}}$$
Our model is then $$ \hat{y}^i = \sigma(w^Tx^i + b) $$
Now, we need a loss function, L(y^i,yi)R+{0}. Such that the following properties hold: L decreases as y^iyi decreases, and it would be nice if L is convex, as a function of the weights and bias. That way, no matter where we start, our gradient descent will find the global minimum.

We claim that L=(y^iyi)2 is not a good choice, because of course L decreases as yi^ gets closer to yi, but L is not convex, as a function of the weights or bias.

It is sufficient to show that L is not convex in the weights, as that would be bad enough. Fix xi,yi,b. then $$ \mathscr L(w) = \bigg({\dfrac{1}{1+e^{-(w^Tx^i + b)}}- y^i}\bigg)^2$$

so if we fix all other weights except wj and parametrize it by wj=t then, $$\mathscr L(t) = \bigg(\frac{1}{1+ e^{-(x^i_{j}t + C)}} - y^i\bigg)^2$$
essentially, we have $$\mathscr L(t) =\bigg(\frac{1}{1+ ae^{-bt }} - c\bigg)^2 $$
and we could of course evaluate the second derivative of L, or we could look at the graph. (this for c=1) Support/Figures/Pasted image 20240812193336.png
Notice that the left half is concave, with the function being above the secant line between any two points on the function.

A better loss function:
L(y^i,yi)=(yilogyi^+(1yi)log(1y^i))
for a fixed yi and xi, it is pretty clear that L depends on the weight vector w, and bias b.
in particular, for ease, let Δ=wTxi+b. and yi=c. Doing some algebra using: $$ \hat{y}^i = \frac{1}{1+ e^{-\Delta}} $$ we get: $$L(\Delta) = \log(e^{\Delta}+1) - c\Delta $$
Finally, we have, that for any weight wj in the weight vector, parametrizing wj=t and fixing all other weights, we have Δ(t)=txji+b+D where D is everything in the dot product of w and xi that isn't wj=t times xji.
Essentially, Δ(t)=At+B. so $$L(t) = \log(e^{At + B} + 1) - c(At + B)$$
Hence, $$L'(t) = \frac{{Ae^Be^{At}}}{e^{At + B}+1} - cA$$
That is, $$L'(t) = A\bigg(\frac{{e^{At + B}}}{1 + e^{At + B}} - c\bigg)$$

Finally we see that L is convex, with respect to every single weight.

L(t)=A2eAt+B(1+e(At+B))2>0

notice that we had set wj=t to be the moving parameter, and fixed everything else. We could set any weight as the moving parameter t and still obtain L(t)>0. Hence, the above inequality essentially says that 2Lwj2>0 for each 1jnx. Hence, very informally, no matter in which axis (Imagine the loss function being L:RnxR+ {0}, being plotted in Rnx+1 dimensional space) we perturbate in, the rate of change of the function is increasing. This is especially nice because the whole point of gradient descent to initialize weights as a point wRnx then, the following algorithm: $$\text{for} \ j = 1 \ \text{to} \ n_{x} : w_{j} \leftarrow w_{j} - \frac{\alpha {\partial L}}{\partial w_{j}} $$ always finds the (one and only) well because L is strictly convex with respect to each axis/weight wj.
In general, the "well" idea can be seen in the graph of logistic regression with just two weights as the parameters.
![[Support/Figures/Pasted image 20240812223757.png]

here, just think about x,y as weights, and z as the loss function.
Support/Figures/Pasted image 20240812224340.png
if we're restricted to x>0,y>0 we might start on any large x,y and descend, noticing that as x and y decrease, the slope becomes less and less negative, becoming zero where the red and blue lines cross.

In practice we want to minimize the average loss of the model over the entire training set.
The cost function does this:

J=i=1mL(yi^,yi)

Indeed, the new algorithm for gradient descent then becomes:

for j=1 to nx:wjwjαJwj

where α is the learning rate. Here, It's pretty self explanatory how to calculate the derivative of this "cost" function with respect to a weight: we simply accumulate the derivate of the loss function wrt to that weight for each training example. Jwj=i=1mLwj.

For the logistic regression model, let's figure out the backpropagation:

L(y^i,yi)=(yilogyi^+(1yi)log(1y^i))
Where y^i=11+ez, z=wTxi+b

Let's break this down:

Ly^iy^izzwj=LwjLy^iy^izzb=Lb

One pattern we notice here is that we always take the derivative of an intermediate variable wrt to the immediately incoming input variable.
So we will just use t as a placeholder to for the immediate incoming input variable, (IIIV)

in this way, $$L = c\log(t) + (1-c)\log(1-t)$$

L=ct1c1t=ctt(1t)y^i(t)=11+ety^i=et(et+1)2z(t)=txji+C$$(herewj=t)$$z(t)=xji
def dL(t,c):
return (c-t)/(t*(1-t))

def dy_hat(t):
return exp(t)/((exp(t)+1)**2)

def dz(t,j):
return$x^i_j

def dz(t)
return 1