8- probability - Conditional expectation and variance

Imagine we have a stick depicting the interval I=[0,L]
Suppose Y is a random variable on I, meaning fY(y) is the probability density of picking the point yI.
Suppose X is another random variable, which is determined by sampling a point y from Y according to fY(y), and then sampling a point x from the restricted random variable fX|y, which samples in the range [0,y] with a distribution fX.

So the random variable X depends on the random variable Y itself.
Pick a particular y.
Then, the expected value of X, given the point y, is written E[X|Y=y]=0yxfX|Y(x|Y=y)dx

Hence, the integral is some function of y, so we can say E[X|Y=y]=g(y).
However, before sampling a particular y, we can say that E[X|Y]g(Y). That is, the expected value of X given a random variable Y is itself a random variable in Y. This makes sense, for a fixed distribution of picking x (after picking some y), the distribution of picking this y affects picking x (remember that X is the overall act of picking both, one after the other)
then, E[E[X|Y]]=E[g(Y)]=0Lg(y)pY(y)dy.
Using g(y)'s definition, E[E[X|Y]]=0L(0yxfX|Y(x|y)dx)pY(y)dy

There is a scent of the total expectation theorem in integral form here. We are splitting random variable X, into disjoint portions (one for each y)and getting the conditional expected value on X in the inner integrand, and in the outer integrand we are "adding up" all the disjoint portions. Hence, this is equivalent to simple evaluating the expected value of the random variable X itself. (where X depicts the two fold act described above)

Hence, E[E[X|Y]]=X. (law of iterated expectation)

The above idea strictly applies when the random variable X is completely determined when you pick a particular y from Y.

Notice that var(X|Y)=E[X2|Y]E[X|Y]2
again, variance of X|Y is itself a random variable, as whenever y is realized from Y, only then X is completely determined (X is always a two step process if you will.)
hence E[var(X|Y)]=E[E[X2|Y]]E[E[X|Y]2]
Therefore E[var(X|Y)]=E[X2]E[E[X|Y]2]
since E[X|Y] is also a random variable,
var(E[X|Y])=E[E[X|Y]2]E[E[X|Y]]2=E[E[X|Y]2]E[X]2
adding both, we get $$E[var(X|Y)] + var(E[X|Y])) = var(X)$$
This is called the total variance theorem
Let us try to get some sort of physical interpretations of both the theorems: iterated expectation and total variance.
So imagine you have a class, taking machine learning. They are divided into the CS major group, the math major group, and the physics major group. so let the random variable Y take the values 0,1,2 for cs, math and physics majors resp. And let X be the random variable depicting quiz scores of this class.
Then the expected quiz score, E[X] can be computed using the groups. That is, given the Cs majors, we have an expected score E[X|Y=0], for the math majors, we have an expected score E[X|Y=1] and for the physics majors an expected score E[X|Y=2].
We can model these 3 conditional scores as a random variable E[X|Y] which is a function of Y itself, the expected score E[X|Y] depends on which group in Y you pick. Then the expected score itself E[X]=p(y0)E[X|Y=0]+p(y1)E[X|Y=1]+p(y2)E[X|Y=2]. This is just the total expectation theorem. we break into disjoint groups, if we know the expected values of each groups, we can weight them by the probability of picking that group, and add it up.
Hence E[X]=E[E[X|Y]].

Now, what is the variance of the scores of the class, var(X)? E[X|Y] is the random variable that models expected the score given each group. so var(E[X|Y]) depicts the variation between groups, on the expected score.
and E[var(X|Y)] models the expected variance inside each group. So intuitively, we have the total variance theorem.