In a regression problem, given an input vector and it corresponding output , we have to design a model for some parameters . The standing assumption of linear regression is that is measured noisily as a function of , where where is drawn from a random variable with a gaussian (or normal) distribution with zero mean.
Equivalently, . allowing , we have that .
So the moment we see an input , through the deterministic affine function, we immediately also (implicitly) see . Thus, the probability (density) of seeing a particular given the input is same as the probability of getting the noise .
that is, $$ p(y|\mathbf{x}) = p(\epsilon)$$
Therefore, since is taking from a gaussian with zero mean, $$ p(y|\mathbf{x}) = \frac{1}{\sqrt{ 2\pi\sigma^2 }}e^{-1/2\sigma^2(y-(\mathbf{w}^T\mathbf{x} + b))^2}$$
So if we are given input vectors sampled independently as a matrix , then the probability of getting a vector of outputs given the input matrix is just the product of the independent probabilities of seeing given . .
We want to maximize this expression (the likelihood of seeing an output vector , equivalently, we have to minimize seeing the negative log likelihood, and ignoring the constants which do not depend on the weights, we want to get: .
The below sentence was taken from Deep Dive into Deep Learning:
It follows that minimizing the mean squared error is equivalent to the maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise.
Let us do the following: let , then allow . where reach is a column vector, where we concatenate a after each .
Then the let the matrix whose columns are be denoted as . let .
Then we can write
Thus the goal is to minimize the squared norm of the error vector, ie we want $$\theta^* = \arg \min_{\theta} f(\theta) = \left(\lvert \lvert V^T\theta -\mathbf{y}\rvert \rvert_{2} \right)^2 $$ And doing some calculus, we find that the gradient vector of with respect to , is given as : $$\partial_{\theta} f(\theta) = 2V(V^T\theta - \mathbf{y})$$
I apologize for not making bold, it is indeed a vector. Anyway, we want the gradient to be the zero vector.
Therefore, we have $$ \theta^* =(VV^T)^{-1}(V\mathbf{y})$$
So The matrix has to be invertible for our analytic solution. So the rows of must be linearly independent or something idk, either way, its far too restrictive. Basically In practice, some form of gradient descent is used, there is seldom any issue with deep networks minimizing loss on the training data, but generalization is often a bigger problem.