DDL - 0 - Linear Regression

In a regression problem, given an input vector $x$ and it corresponding output $y$ , we have to design a model $y = f (x, θ)$ for some parameters $θ$ . The standing assumption of linear regression is that $y$ is measured noisily as a function of $x$ , where $y = w^{T} x + b + ϵ$ where $ϵ$ is drawn from a random variable with a gaussian (or normal) distribution with zero mean.
Equivalently, $ϵ = y - (w^{T} x + b)$ . allowing $\hat{y} = w^{T} x + b$ , we have that $y = \hat{y} + ϵ$ .
So the moment we see an input $x$ , through the deterministic affine function, we immediately also (implicitly) see $\hat{y}$ . Thus, the probability (density) of seeing a particular $y$ given the input $x$ is same as the probability of getting the noise $ϵ = y - \hat{y}$ .
that is, $$ p(y|\mathbf{x}) = p(\epsilon)$$

Therefore, since $ϵ$ is taking from a gaussian with zero mean, $$ p(y|\mathbf{x}) = \frac{1}{\sqrt{ 2\pi\sigma^2 }}e^{-1/2\sigma^2(y-(\mathbf{w}^T\mathbf{x} + b))^2}$$

So if we are given $n$ input vectors sampled independently as a matrix $X$ , then the probability of getting a vector of outputs $y$ given the input matrix $X$ is just the product of the independent probabilities of seeing $y^{(i)}$ given $x^{(i)}$ .
$P (y | X) = \prod_{i = 1}^{m} p (y^{(i)} | x^{(i)})$ .
We want to maximize this expression (the likelihood of seeing an output vector , equivalently, we have to minimize seeing the negative log likelihood, and ignoring the constants $π, σ$ which do not depend on the weights, we want to get: $w^{*}, b^{*} = \arg min_{w, b} \sum_{i = 1}^{m} (y^{(i)} - w^{T} x^{(i)} - b)^{2}$ .

The below sentence was taken from Deep Dive into Deep Learning:

It follows that minimizing the mean squared error is equivalent to the maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise.

Let us do the following: let $x^{(i)} \in R^{d}$ , then allow $v^{(i)} = (x^{(i)}, 1) \in R^{d + 1}$ . where reach $v^{(i)}$ is a column vector, where we concatenate a $1$ after each $x^{(i)}$ .
Then the let the $d + 1 \times m$ matrix whose columns are $v^{(i)}$ be denoted as $V$ . let $θ = (w, b) \in R^{d + 1}$ .
Then we can write $\hat{y} = V^{T} θ$
Thus the goal is to minimize the squared norm of the error vector, ie we want $$\theta^* = \arg \min_{\theta} f(\theta) = \left(\lvert \lvert V^T\theta -\mathbf{y}\rvert \rvert_{2} \right)^2 $$ And doing some calculus, we find that the gradient vector of $f$ with respect to $θ$ , is given as : $$\partial_{\theta} f(\theta) = 2V(V^T\theta - \mathbf{y})$$

I apologize for not making $θ$ bold, it is indeed a vector. Anyway, we want the gradient to be the zero vector.
Therefore, we have $$ \theta^* =(VV^T)^{-1}(V\mathbf{y})$$
So The matrix $V V^{T}$ has to be invertible for our analytic solution. So the rows of $V$ must be linearly independent or something idk, either way, its far too restrictive. Basically In practice, some form of gradient descent is used, there is seldom any issue with deep networks minimizing loss on the training data, but generalization is often a bigger problem.