Now if you have DISCRIMINITIVE MODEL, you learn: what is the class like, given the features. Right? So that means, given an input (with some underlying distribution), what is the distribution on the classes like, conditioned on this input.
So a discriminative model learns or . On the other hand, a GENERATIVE model, learns What are the features like given a class? so we are learning .
Let us make a tumor classifier using GDA:
The standing assumption is that: conditioned to the tumor being malignant, the distribution of the features is gaussian, and conditioned to the tumor being benign, the features are also gaussian. , . if is a multivariable Gaussian rv, then, . Note that if a rv is vector valued, so is its expected value, so it makes sense to take transpose and so on.
![[Support/Figures/Pasted image 20250204092720.png#invert]]
But since , , . where these sigmas are called co-variance matrices. These matrices are positive semi definite. (a matrix is positive semi definite if is symmetric, and )
writing the multivariate gaussian densities and (for the malignant and benign feature-distributions resp). given a training set , we write the joint likelihood as a function of the chosen parameters for both the gaussians, and a parameter for the distribution of we have
Where:
and
As is the beautiful case with gaussians and maximizing likelihoods, the best estimate for is the arithmetic mean of all the feature vectors which are benign. the best estimate for is also the same idea. The best estimate for is the fraction of examples that are malignant. The best estimate for the co-variance matrices are done by taking our predicted means, subtracting them from each feature vector of that class, and multiplying this quantity with its own transpose, and then averaging it out. I will leave the details out.
Naive bayes
Imagine a dataset of handwritten digit images. For simplicity sake, assume each pixel is either (fully black) or (fully white). flatten the digit images into feature vectors of dimension each. call the input features by and their class labels . We want to learn yet again.
The retarded way of learning this distribution, is to look at all possible feature vectors, (there are of them), for each of them, get the probability for each class,that is, for each feature vector , learn for each . so in total we need parameters.45
Does this help? now instead of learning the distribution for each feature vector and each label, we have to learn the distribution of the first component for each label, the second component conditioned on the first, and so on. How does this pan out? there are two choices for and we need to learn 10 conditional probabilities (so ), then for each of the twenty pairs of . we need to learn the probabilities of both the choices of , and so we go so this is still multiplying by two over and over. this is still just as bad.
But now we commit a cardinal sin, and assume that the features's components are all conditionally (on y) independent.
Therefore, . Hence, we can store a matrix . (here the approx sign simply means estimation)
Then give as whenever needed.
using our fancy indicator style function, .
In addition we can also estimate by just getting the fraction of occurrences of each class.
to get simply count the number of examples where the th pixel is white, (xi = 1) among those whose label is , and divide it by the total number of those whose label is k. and do it for all of them.
The idea with both of these generative models is that can be computed at the end using bayes rule.