Inference

Support/Figures/Pasted image 20250120120918.png
So we have some signal $S$ going through a box which does something "deterministic", and turns $S \mapsto f (S, a)$ , where $a$ is some collection of parameters, and then some sort of noise $W$ is applied, to get the final output, So we can say $X = g (f (S, a), W)$ . Where $f$ is a process that we actually know. but intead of getting the result directly from $f$ , there is some Noise $W$ being applied by $g$ . (we don't know $g$ ).

Then there are two types of inference problems:

Where we can see both $S$ and $X$ , and use it to estimate the parameters $a$ , of the box $f$ , dealing with the noise applied by $g$ .
Where we know the parameters $a$ , And we can measure the output (noisy) signal $X$ , and try to figure out what $S$ went inside.

Support/Figures/Pasted image 20250121053310.png

So HWEN we have a finite set of possibilities for the unknown, we have to make a decision with least probability of error. (Hypothesis testing)
if the unknown is on a continuum, we aim at a small estimate error. (Estimation)

Let us talk about estimating the mass of electron. Of course the mass of the electron $θ$ is not a random thing, it is a constant.
Imagine we have a noisy measuring apparatus, that takes in the mass of the electron and gives out a measurement modelled as a random variable $X$ .

In classical statistics, The model of the measuring apparatus, is a probability distribution of $X$ , which is of course affected by the mass of the electron $θ$ and some noise. So in classical statistics, we are thinking of measuring apparatus, as simply putting a distribution over the noisy measurement of the mass of the electron.

The Bayesian philosophy is that even though the mass of an electron is constant, what we DONT KNOW, we should put a distribution on it. so $p_{Θ}$ is some "prior" distribution on $θ$ , which if we don't know ANYTHING we might say, all masses of the electron in a certain range are equally likely, or we can put a more sophisticated prior distribution $p_{Θ}$ on $θ$ , based on previous work.

The model of our measurement box, is a distribution of the random variable $X$ , given the random variable $Θ$ , with distribution $p_{Θ} (\cdot)$ .

The "new and better" random variable that we need to put a distribution on, is actually $Θ | X$ (posterior distribution), because if we had some "prior" belief on the distribution of $Θ$ , and out measuring box, gives us the distribution of the output data $X$ conditioned on the distribution of $Θ$ , We hope that when our prior belief on $Θ$ improves to a better belief, after conditioning it on the output data $X$ .

Bayesian hypothesis testing and estimation

Support/Figures/Pasted image 20250121055300.png

Support/Figures/Pasted image 20250121055449.png

So in this picture, $Z_{t}$ is the height of a bird at time $t$ . We believe it to be parabolic, and so we want to estimate the coefficients. So we have some prior beliefs on these coefficients (as random variables), and we make measurements at $n$ time steps, modelled by the random variable $X_{t}$ , unfortunately, the measurements are noisy, so we think that $X_{t}$ is equal to $Z_{t}$ plus some pure noise random variable $W_{t}$ . And if we have a reasonable idea of what distribution to put on $W_{t}$ , and hence infer the distribution on $X_{t}$ , We have all the data, to use the bayes rule, to UPDATE out belief of the distribution on the priors $Θ_{1}, Θ_{2}, Θ_{0}$ , to the posterior belief (after seeing some data) which has the distribution shown in the above picture.
Support/Figures/Pasted image 20250121061434.png

Least mean square estimation

Support/Figures/Pasted image 20250121063750.png

In the above picture, we have no information more about $Θ$ ,
and we want to estimate it with a single real number $c$ .
We can do this by minimized $E [(Θ - c)^{2}]$ , which using calculus, we get $c = E [Θ]$ . So if our goal to minimize the expected squared error (or mean squared error) between a random variable and an estimate, we best set the estimate to the mean of the random random variable.

Now, let us introduce some extra information, and say that a measuring device, that noisily measures $Θ$ outputs the random variable $X$ .
So then, in the usual Bayesian way, we want an estimate $g (X)$ , which is a random variable, that whenever we see that $X = x$ , we think $g (x)$ is a good estimate for $Θ$ .

Basically what we are saying, is that given the data $X$ , whenever $X = x$ , we want to minimize the posterior error $E [(Θ - g (x))^{2} | X = x]$ . The function $g$ is called an estimator, because based on the random variable $X$ , and whatever value we observe from it, it gives an estimate of $Θ | X = x$ .

Now, the question is, could there be an estimator of the posterior $Θ | X$ that has the least mean squared error, no matter what value $X$ collapses to?

When, Let us use the arbitrary choice of $x$ technique to show that there is global best estimator.

Well, pick arbitrary value $x$ that $X$ collapses to. In this new (and fixed) conditional universe, we want to minimize $T = E [(Θ - c)^{2} | X = x]$ . Expanding it out: $T = E [Θ^{2} | X = x] + c^{2} - 2 c E [Θ | X = x]$ , Again taking the derivative of $T$ wrt to $c$ , and setting it equal to zero, we get that the best $c$ , is actually just $c = E [Θ | X = x]$ .

So no matter what $x$ we pick, for that $x$ , the least mean squared estimate for $Θ | X = x$ is just $E [Θ | X = x]$ .

Which means that for any estimator rv $g (X)$ of the unknown $Θ$ , given data $X$ , is actually the random variable $E [Θ | X]$ .

Support/Figures/Pasted image 20250121071205.png

So we have concluded that the best estimator (or bestimator), in the besian setting: of having some unknown $θ$ , modelleled by the random variable $Θ$ , and some prior distribution $f_{Θ} (\cdot)$ , and a noisy measurement modelized by $f_{X | Θ}$ , which outputs measurements, which are modelled by the random variable $X$ , where we update out prior distribution on $Θ$ to a better posterior distribution, on $Θ | X$ using the bayes rule.

So the best estimator $g (X)$ in this setting, which minimizes expected squared error between $Θ$ and $g (X)$ in the universe conditioned on the random variable $X$ , is $g (X) = E [Θ | X]$ .

And furthermore, it is not only true that $E [Θ | X]$ beats out any other $g (X)$ (for LMS error) in the universe conditioned on $X$ , but it beats out any estimator $g (X)$ in general.

That is, for some observed data modelled by $X$ , The estimate $E [Θ | X]$ for $Θ$ itself, in the conditional universe, is a better estimate than any other function $g (X)$ on the observed data.

We write this estimator $E [Θ | X]$ as $\hat{Θ}$ .

So HWAT IS THE ERROR OF THIS ESTIMATOR? well, its $\tilde{Θ} = \hat{Θ} - Θ$ . what is the expected value of this error
Support/Figures/Pasted image 20250121081117.png

The above idea comes from the fact that $\hat{Θ} = g (X)$ , hence $E [\hat{Θ} | X] = \hat{Θ}$ , (when we give the random variable $X$ , $\hat{Θ}$ is determined, hence its expected value is itself), and also by definition $\hat{Θ} = E [Θ | X]$ .

So my expected error between the bestimator and the unknown random variable, conditioned on any data, is the random variable $O$ . (zero)

That is $E [\tilde{Θ} | X] = O$ . Notice that this expectation is actually a function on the random variable $X$ . if no matter what $x$ you pick, this function takes it to $0$ , then the expected value of the error (unconditioned) $E [\tilde{Θ}] = 0$ (the the number 0).

Now, notice that $E [\tilde{Θ} h (X) | X] = h (X) E [\tilde{Θ} | X] = O$ . Hence, no matter what the random variable $X$ is, the expected value, $E [\tilde{Θ} h (X)] = 0$ (the number zero).

Now for the linear LMS, we assume our estimator to have an affine relationship with the data variable $X$
Support/Figures/Pasted image 20250121085407.png

There is some nice intuitive idea here. Of course, given a prior on $Θ$ , we start with the mean(or expectation) of this prior.
Then, given the data $X$ , we look at the covarience of $Θ, X$ .
If the data we collected has no covariance (covarience is zero) then the data we collected is useless. Hence the mean of the prior is still the best we can do.

if covariance of $Θ, X$ is positive, they both are bigger than their means, or smaller than theirs means together, respectively. In this case, for some observation $x$ if it was bigger than the mean $E [X]$ , then the $θ$ that came in, was probably bigger than it's own mean, and because co-variance is positive, this is what actually happens, we add something to $E [Θ]$ . And the other way around works too, whenever $x$ is smaller than its mean, since covarince is positive, we are taking a little away from $E [Θ]$

if covariance of $Θ, X$ is negative, then an analogous argument can be made.

So we are making corrections on our prior, based on the covariance of the data and the prior.
The error of this linear estimator, increases when the prior itself is quite uncertain, that is, $σ_{Θ}^{2}$ is large. And the error gets smaller when the correlation $ρ$ between $X$ and $Θ$ has a large magnitude.
Support/Figures/Pasted image 20250121090641.png

Support/Figures/Pasted image 20250121091211.png

Classical inference

Support/Figures/Pasted image 20250121092701.png
Basically, here, we dont treat the input as a random variable. Rather, we might think about the input as a collection of parameters, that models our noisy measurements $p_{X} (x; θ)$ .
That is, we have different models or distributions on our measurement $X$ , and all of these together are parametrized by values of $θ$ .

So how do we get a good estimator then? We use maximum likelihood. That is, for an estimation problem, in the classical setting, we simply want to pick $θ^{*} = \arg max_{θ} p_{X} (x; θ)$ . if we make a single measurement, we want to pick the parameters that make this measurement most probable. if we make a vector of measurements, (x is a vector) then we want the parameters that make this vector of measurements as highly probably as we can.

Compare this with beysian map inference, $θ_{m a p} = \arg max_{θ} p_{Θ | X} (θ | x)$ . using bayes rule, $θ_{m a p} = \arg max_{θ} \frac{p_{X | Θ} (x | θ) p_{Θ} (θ)}{p_{X} (x)}$ . Now, the denominator doesn't contain theta, and moreover, if we assume an "uninformative prior" where every $θ$ is equally likely, then $p_{Θ} (θ)$ is a constant, so under these conditions,
$θ_{m a p} = \arg max_{θ} p_{X | Θ} (x | θ)$ . Since the distribution on $Θ$ is uniform, there must be some probability model $q_{X} (x; θ)$ parametrized by $θ$ giving probabilities that $X = x$ . Why is it so? well because the distribution of $Θ$ is already fixed, and decided to be uniform, so two different $θ_{1}, θ_{2}$ are equally likely. so we can say that each $θ_{i}$ specifies a probability distribution on $X$ , that has nothing to do with how likely $θ_{i}$ is, because all of them are equally likely.

Hence, in the case of an uninformative (uniform) prior distribution, Map inference and MLE (maximum likelihood estimation) are equivalent.

So we can write $θ_{M L} = \arg max_{θ} p_{X} (x | θ)$ . in particular, if $X$ models a vector of independent observations, ( a joint distribution of $X_{1}, X_{2}, . . . X_{n}$ ) Then,
$θ_{M L} = \arg max_{θ} \prod_{i = 1}^{n} p_{X_{i}} (x_{i} | θ)$
Let us say we want to find the derivative of $T = \prod_{i = 1}^{n} p_{X_{i}} (x_{i} | θ)$ , wrt to $θ$ then due to the chain rule of multiplication, we need to take the derivative of $p_{X_{i}} (x_{i} | θ)$ , and multiply with with all other terms in the product: do this for each $i = 1 \to n$ , and add them all up. So we are doing about $O (n^{2})$ multiplications and additions.
since $x \mapsto - \log (x)$ is strictly decreasing, equivalently, instead of maximzing $p_{X} (x | θ)$ we can minimize $Q = - \log (p_{X} (x | θ))$ , whose derivative wrt to $θ$ is much easier to calculate. Thati is, $$ \theta_{ML} = \arg \min {\theta} \sum^n -\log(p_{X_{i}}(x_{i}|\theta))$$

And $$\frac{dQ}{d\theta} = \sum_{i=1}^n \left(\left(\frac{{d}}{d\theta}{p_{X_{i}}(x_{i}|\theta)}\right) \cdot \frac{1}{p_{X_{i}}(x_{i}|\theta)}\right)$$ where only requires a linear number of terms.

The negative log probabilities, are also tied to information theory. Where the entropy of the random variable, is the expectation of the random variable obtained by taking negative log on the original one. $H [X] = E [- \log (X)]$ .

So now an interested problem. Suppose we draw $x_{1}, x_{2}, \dots x_{n}$ from a gaussian distribution with variance $1$ . what is the MLE of the mean?
Support/Figures/Pasted image 20250121112849.png
So thats a gaussian. but if the variance is $1$ , then $σ = 1$ as well. so $$f(x) = ce^{-{(x-u)^2}/2}$$

So $X$ is the random variable resenting the joint, independent variables $X_{1}, X_{2}, . . . X_{n}$ , each $X_{i}$ has the same gaussian $f (x)$ on it as a distribution. so we want to calculate the probability of getting $x = x_{1}, x_{2}, . ., x_{n}$
so $$p_X(x|\mu) = \prod_{i=1}^np_{X_{i}}(x_{i}) = c^n\prod_{i=1}^ne^{-{(x_{i}-\mu)^2/2}}$$
Hence it is sufficient to maximize $$T(\mu) = \prod_{i=1}^ne^{-(x_{i}-\mu)^2/2}$$
Equivalently we can minimize (by taking -log on both sides) $$ Q(\mu) = \sum_{i=1}^n(x_{i}-\mu)^2 = \sum_{i=1}^nx_{i}^2 +n\mu^2 - 2\mu\sum_{i=1}^nx_{i}$$ Doing some calculus, and moving things around, we find the obvious $$ \mu_{ML} = \frac{\sum_{i=1}^nx_{i}}{n}$$

We can say that this $μ_{M L}$ is actually a random variable that depends on the random variables $X_{1}, X_{2}, \dots X_{n}$ (which in our case is the same random $X$ variable sampled over and over).

Υ_{M L} = \frac{\sum_{i = 1}^{n} X_{i}}{n}

So this is a random variable that depicts the estimator of the true mean $μ$ . What is the expected value of this random variable? That is on average, when we do this sampling process many times and get many different $μ_{M L}$ 's, what would be the average $u_{M L}$ we see?

E [Υ_{M L}] = \frac{1}{n} \sum_{i = 1}^{n} E [X_{i}] = μ

well since in this case, each $X_{i}$ has the same gaussian distribution with mean $μ$ , it turns out that the average $u_{M L}$ is actually the correct mean $μ$ which we are trying to estimate.

In other words, the random variable MLE estimator of $μ$ , $Υ_{M L}$ is expected to be the true mean $μ$ itself. This means that our estimator is unbiased. On average, it does not favour something larger or smaller than the unknown it's trying to estimate.

So then, We might consider an estimator that we get by applying MLE to some noisy measurement $X$ who distribution is parametrized by the unknown $θ$ we want to estimate.
call this estimator random variable ${\hat{Θ}}_{n} (x; θ)$ Where this estimator depends on the vector $x = x_{1}, x_{2}, \dots x_{n}$ we independently sampled (each from $X$ which comes with a model of our noisy measurements) when the true input parameter was $θ$ .

Then, it is nice if the following properties hold for all $θ$ :

Unbiased : $E [\hat{Θ_{n}}] = θ$ . Now this is sometimes hard to come by for MLE estimators in general, but there are some nice asymptotic properties, for example that as we increase $n$ (the number of measurements we make) The estimators become more and more unbiased.
Consistency $\hat{Θ_{n}} ⟶ θ$ . That is, under probability measure (which is also a probability metric), Our estimator converges to the true parameters.