8- probability - Conditional expectation and variance
Imagine we have a stick depicting the interval
Suppose is a random variable on , meaning is the probability density of picking the point .
Suppose is another random variable, which is determined by sampling a point from according to , and then sampling a point from the restricted random variable , which samples in the range with a distribution .
So the random variable depends on the random variable itself.
Pick a particular .
Then, the expected value of , given the point , is written
Hence, the integral is some function of , so we can say .
However, before sampling a particular , we can say that . That is, the expected value of given a random variable is itself a random variable in . This makes sense, for a fixed distribution of picking (after picking some ), the distribution of picking this affects picking (remember that is the overall act of picking both, one after the other)
then, .
Using 's definition,
There is a scent of the total expectation theorem in integral form here. We are splitting random variable , into disjoint portions (one for each )and getting the conditional expected value on in the inner integrand, and in the outer integrand we are "adding up" all the disjoint portions. Hence, this is equivalent to simple evaluating the expected value of the random variable itself. (where X depicts the two fold act described above)
Hence, . (law of iterated expectation)
The above idea strictly applies when the random variable is completely determined when you pick a particular from .
Notice that
again, variance of is itself a random variable, as whenever is realized from , only then is completely determined (X is always a two step process if you will.)
hence
Therefore
since is also a random variable,
adding both, we get $$E[var(X|Y)] + var(E[X|Y])) = var(X)$$
This is called the total variance theorem
Let us try to get some sort of physical interpretations of both the theorems: iterated expectation and total variance.
So imagine you have a class, taking machine learning. They are divided into the CS major group, the math major group, and the physics major group. so let the random variable take the values for cs, math and physics majors resp. And let be the random variable depicting quiz scores of this class.
Then the expected quiz score, can be computed using the groups. That is, given the Cs majors, we have an expected score , for the math majors, we have an expected score and for the physics majors an expected score .
We can model these 3 conditional scores as a random variable which is a function of itself, the expected score depends on which group in you pick. Then the expected score itself . This is just the total expectation theorem. we break into disjoint groups, if we know the expected values of each groups, we can weight them by the probability of picking that group, and add it up.
Hence .
Now, what is the variance of the scores of the class, ? is the random variable that models expected the score given each group. so depicts the variation between groups, on the expected score.
and models the expected variance inside each group. So intuitively, we have the total variance theorem.