Naive bayes classifier, GDA

GDA: (Gaussian Discriminant Analysis)

Now if you have DISCRIMINITIVE MODEL, you learn: what is the class like, given the features. Right? So that means, given an input x (with some underlying distribution), what is the distribution on the classes like, conditioned on this input.
So a discriminative model learns p(Y=y|X=x) or p(y|x). On the other hand, a GENERATIVE model, learns What are the features like given a class? so we are learning p(x|y).

Let us make a tumor classifier using GDA:
The standing assumption is that: conditioned to the tumor being malignant, the distribution of the features is gaussian, and conditioned to the tumor being benign, the features are also gaussian. p(x|y=1)N(μ1,Σ1), p(x|y=0)N(μ0,Σ0). if Z is a multivariable Gaussian rv, then, ΣZ=E[(Zμ)(Zμ)T]=E[ZZT]E[Z]E[Z]T. Note that if a rv is vector valued, so is its expected value, so it makes sense to take transpose and so on.
![[Support/Figures/Pasted image 20250204092720.png#invert]]

But since xRn, μ0,μ1Rn, Σ0,Σ1Rn×n. where these sigmas are called co-variance matrices. These matrices are positive semi definite. (a matrix M is positive semi definite if M is symmetric, and xTMxR+,x0)
writing the multivariate gaussian densities fX|Y=1(x;μ1,Σ1) and fX|Y=0(x;μ0,Σ0) (for the malignant and benign feature-distributions resp). given a training set (xi,yi)i=1m, we write the joint likelihood as a function of the chosen parameters for both the gaussians, and a parameter for the distribution of Y we have

L(μ0,μ1,Σ0,Σ1,ϕ)=i=1mp(xi,yi;μ0,μ1,Σ0,Σ1,ϕ)=i=1mp(xi|yi)p(yi)

Where:

p(xi,yi)={fX|Y=1(xi;μ1,Σ1)       yi=1fX|Y=0(xi;μ0,Σ0)       yi=0

and

p(yi)=ϕyi(1ϕ)1yi

As is the beautiful case with gaussians and maximizing likelihoods, the best estimate for μ0 is the arithmetic mean of all the feature vectors which are benign. the best estimate for μ1 is also the same idea. The best estimate for ϕ is the fraction of examples that are malignant. The best estimate for the co-variance matrices are done by taking our predicted means, subtracting them from each feature vector of that class, and multiplying this quantity with its own transpose, and then averaging it out. I will leave the details out.

Naive bayes

Imagine a dataset of m n×n handwritten digit images. For simplicity sake, assume each pixel is either 0 (fully black) or 1 (fully white). flatten the digit images into m feature vectors of dimension d=n2 each. call the input features by x and their class labels y. We want to learn p(x|y) yet again.
The retarded way of learning this distribution, is to look at all possible feature vectors, (there are 2d of them), for each of them, get the probability for each class,that is, for each feature vector x, learn p(x|y) for each y. so in total we need 102d parameters.45

p(x1,x2,xd|y)=p(x1|y)p(x2|x1,y)p(xd|xd1,x1,y)

Does this help? now instead of learning the distribution for each feature vector and each label, we have to learn the distribution of the first component for each label, the second component conditioned on the first, and so on. How does this pan out? there are two choices for x1 and we need to learn 10 conditional probabilities (so 20), then for each of the twenty pairs of x1,y. we need to learn the probabilities of both the choices of x2, and so we go 20+40+80,.... so this is still multiplying by two over and over. this is still just as bad.

But now we commit a cardinal sin, and assume that the features's components are all conditionally (on y) independent.

Therefore, p(x|y)=i=1dp(xi|y). Hence, we can store a matrix Pxy[i][j]p(xi=1|y=j). (here the approx sign simply means estimation)
Then give p(xi=0|y=k) as 1Pxy[i][k] whenever needed.

using our fancy indicator style function, p(xi|y)=(Pxy[i][y])xi(1Pxy[i][y])1xi.

In addition we can also estimate p(y) by just getting the fraction of occurrences of each class.

to get Pxy[i][y=k] simply count the number of examples where the ith pixel is white, (xi = 1) among those whose label is k, and divide it by the total number of those whose label is k. and do it for all of them.

The idea with both of these generative models is that p(y|x) can be computed at the end using bayes rule.