5 - Probability - Discrete RV

Sample space, probability measure

The set of all possible outcomes of an experiment, such that exactly one of them occurs whenever we do the experiment, (mutually exclusive outcomes and exhaustive space of outcomes) like for example the outcome of a coin toss can't be both "H" and "T". Sample spaces are usually denoted Ω.
Certain subsets of Ω are called "measurable", and for all such subsets, we can assign a probability measure to them, so if Σ is the collection of measurable subsets, then probability measure is a function p:Σ[0,1]. Σ is called a sigma algebra on Ω and is stable under countable union, compliment, and countable intersection, moreover ΩΣ.
There are three axioms on p:
p(Ω)=1
p(A)0
if AB=ϕ, p(AB)=p(A)+p(B).
if Ω is discrete and finite, the uniform probability measure of AΩ is |A|/|Ω|.
if Ω is continuous, then the uniform probability measure is the "area or nd-volume"
Note: Zero probability does not mean impossible. Points in continuous spaces have zero probability, yet the outcome of a certain experiment can be a single point. Similarly probability 1 does not mean the event occurs all the time. for example, the probability of getting any point on the uniform unit square, that isn't the origin is 1. (the origin has zero probability) yet ofc we can pick the origin.

Support/Figures/Pasted image 20250115020236.png

conditional probability

If we get some partial information that event B has occurred, Then the probability that event A occurs given B is : p(A|B)=p(AB)p(B). This is essentially shifting the universe to B. The axioms of probability measure still holds in this universe. for example, p(AB|C)=p(A|C)+p(B|C) as long as A,B are disjoint.
Suppose we partition Ω into events A1,A2,..,Am. And B is also an event. Then, p(B)=i=1mp(BAi)=i=1mp(Ai)p(B|Ai)
if we want to calculate p(Aj|B)=p(AjB)p(B), then we can write p(AjB)=p(BAj)=p(B|Aj)p(Aj).
Hence, p(Aj|B)=p(B|Aj)p(Aj)i=1mp(Ai)p(B|Ai).
This is known as the baye's rule.
An alternative statement with m=1, is p(A|B)=p(B|A)p(A)p(B).
Support/Figures/Pasted image 20250115021451.png

If the occurrence of event A produces no information about the occurrence of event B (and vice versa), these events are independent.
Support/Figures/Pasted image 20250115022017.png
In a conditional universe, if given the occurence of C, the occurence of A produces no information about the accurence of B, then p(AB|C)=p(A|C)p(B|C) (this is called conditional independence).
Support/Figures/Pasted image 20250115022557.png
Support/Figures/Pasted image 20250115022839.png
Support/Figures/Pasted image 20250115023220.png
If we had a game, where Ω is some set of events, and X:ΩR+ is the payoff of each event. with the payoff of x dollars occurring with probability pX(x). What is the expected payoff of this game?
If we play this game N times, we are paid,x dollars roughly pX(x)N times, for each payout x. As N gets very large, for each i=1 to n, we are paid xi amount about pX(xi)N times. so in N games, if N is very large, our total payout is i=1npX(xi)xiN, so averaging it out, the expected value E(X)=i=1npX(xi)xi.

Support/Figures/Pasted image 20250115024528.png
Now, imagine that whenever event ωΩ occurs, instead of getting a payout of X(ω), we get a payout of g(X(ω)) This is completely fine! To calculate the expectation of the new random variable g(X), we know that whenever (X=xi), the payout is instead g(xi).
Hence Allowing Y=g(X), we have the following picture
Support/Figures/Pasted image 20250115025153.png
Support/Figures/Pasted image 20250115025349.png
Support/Figures/Pasted image 20250115025455.png

The variance tells you, what is the expected distance of a random variable form it's mean/expected value?, The standard deviation is the square root of the variance.

The conditional expectation of a random variable X given an event A, the sorta the average value we measure of X, given event A has already occurred. again, if A has already occurred, then the probability that we get a payout of x is equal to pX|A(x), where pX|A is the new probability mass from X to [0,1] where pX|A(x)=p({wΩ:X(w)=x}|A).
Hence E[X|A]:=xxpX|A(x).

Imagine partitioning the sample space into A,Ac.
for any x being the output of the random variable X,
we have that pX(x)=p(A) pX|A(x)+p(Ac) pX|Ac(x)
This is because in a mutually exclusive manner, either event A occurs, and we enter the context where the probability mass of X is conditioned on A, or Ac occurs, entering the probability mass of X conditioned on Ac. Hence, we can write that E[X]=p(A)E[X|A]+p(Ac)E[X|Ac]
In general, $$E(X) = \sum_{i=1}^np(A_{i})E[X|A_{i}]$$ where A1,An is a partition of the sample space.

let X be a random variable indicating the number of independent coin tosses after which we see the first "Heads".
Here the space of events is any sequence of n coin flips, for any nN.
Now, we can partition this space into two events.
let A be the set of all events where the first flip was a tail. and Ac be the set of all events where the first flip was a head.
Then E[X]=p(A) E[X|A]+p(Ac) E[X|Ac]
But Ac is the set of all events where the first flip is "Heads", therefore, E[X|Ac]=1. But A is the set of all events where we flip a tail on the first flip. Since we have already flipped a tail, and subsequent coin tosses are independent, we have effectively come back to the same probability mass on X, but we have spent one extra flip. Hence E[X|Ac]=E[X+1], using linearity of expectation, we get E[X]=(1p)(E[X]+1)+p. Hence E[X]=1/p.

Joint distributions:

if X,Y are random variables, with mass functions pX,pY, the joint distribution pX,Y(x,y):=P(X=x and Y=y). We still have that yxpX,Y(x,y)=1
To extract the probability mass pX, we have pX=ypX,Y(x,y). This is called the marginal distribution of X, where we are marginalizing out Y.

Now, the joint conditional distribution pX|Y(x|y):=P(X=x|Y=y)=pX,Y(x,y)pY(y). Here, we imagine fixing an output of the random variable Y as y and for each such fixture, pX|Y(x|y) has the same "shape" as the "slice" of the joint distribution at that particular y, but is re-scaled by the marginal of Y at that y, to ensure that the probabilities add up to 1.

Just generalizing, suppose <Xi> is a list of random variables, which take on tuple values <xi> for i=1n.

using our intuition, if <Xi> takes values <xi>
Then first X1 takes value x1, then conditioned to that, X2 takes value x2, then conditioned to both of those, X3 takes value x3 and so on.

Hence $$p_{X_{1},X_{2},\dots X_{n}}(x_{1},x_{2},\dots x_{n}) = \prod_{i=1}^n p_{X_{i}|X_{1}, X_{2}, \dots X_{i-1}}(x_{i}|x_{1},x_{2},\dots x_{i-1})$$
Where the ith variable is condition on all the previous ones.
Now, if it is true that these n random variables are pairwise, triple-wise and so on independent, that is the occurrence of any sub-set of these random variables taking some value, gives no information about any other random variable, then for any sub-set indexer a1,ak

pXa1|Xa2,Xak(xa1|xa2,xak)=PXa1(xa1)

So if n random variables are independent, then using the conditional expression seen above, with the other expression even more above, we get:

pX1,X2,,Xn(x1,x2,xn)=i=1npXi(xi)

That is, if the random variables X1,X2,Xn are independent, their joint probability mass is the product of each marginal probability mass.

If X1,X2,..Xn are random variables, with the tuple (x1,x2,xn) giving g(x1,x2,xn) payout, then the expected value: $$E[g(X_{1},X_{2},\dots,X_{n})] = \sum_{x_{1}\in X_{1}} \sum_{x_{2} \in X_{2}} \dots \sum {x \in X_{n}} g(x_{1},x_{2},\dots x_{n}) p_{X_{1},X_{2},\dots X_{n}}(x_{1},x_{2},\dots, x_{n}) $$

Here again, we can show linearity, let g be the map (x1,x2,xn)a1x1+a2x2+anxn Just using our eyes and mind and the distributive law, and factoring things out, we have:

E[j=1najXj]=j=1najE[Xj]

So the expectation of a linear combination of random variables, is the linear combination of the expectation of random variables :)

now if g is the product function (x1,x2,xn)i=1nxi, moreover if the n random variables are independent, then:

E[j=1nXj]=x1X1xnXn(j=1nxjPXj(xj))

in the innermost product, the first n1 terms are independent of the last summation which sums over only possible values of the random variable Xn hence, we can factor them out:

E[j=1nXj]=x1X1xn1Xn1(j=1n1xjPXj(xj)xnXnpXn(xn))

equivalently

E[j=1nXj]=x1X1xn1Xn1(j=1n1xjPXj(xj)E[Xn])

Continuing this process, we get

E[j=1nXj]=j=1nE[Xj]$$Wheneverthe$n$randomvariablesareindependent.Moreover,if$X1,X2,..,Xn$areindependent,then$g1(X1),g2(X2),gn(Xn)$arealsoindependent.Imeannomatterwhatvalue$Xj$takes,itgivesmenonewinformationabout$Xk$,henceIgetnonewinformationabout$gk(Xk)$,therefore$gj(Xj)$cannotgivenewinformationabout$gk(Xk)$either(thisisaveryinformalview,wewillgettoinformationtheorylater)Hence,$$E[j=1ngj(Xi)]=j=1nE[gj(Xi)]

Notice that for variance, var(X+Y)=E[(X+YE[X+Y])2]=E[(X+Y)2](E[X+Y])2 Solving, we get

var(X+Y)=E[X2]+E[Y2]+2E[XY](E[X]2+E[Y]2+2E[X]E[Y])

Now, if X,Y are independent, E[X]E[Y]=E[XY] Therefore,

var(X+Y)=E[X2]E[X]2+E[Y2]E[Y]=var(X)+var(Y)

Indeed by induction if Xi for i=1n are all independent of each other, var(i=1nXi)=i=1nvar(Xi).

Binomial distribution, is the random variable X, which counts the number of successful trails of n independent trials, where the probability of success is p, and failure is (1p). We can use the indicator variable trick.
Support/Figures/Pasted image 20250115222017.png
The good thing about expected value is that it doesn't care about independence.
Support/Figures/Pasted image 20250115222359.png
in the above problem, the Xi i=1n are not independent. if I tell you that each Xi from i=1n1 are all 1, that means that the last remaining hat is the hat of the nth person, so it does change the probability of the nth guy finding his hat, from 1/n given no information to 1 giving this peice of information.
Support/Figures/Pasted image 20250115222735.png