Naive bayes classifier, GDA

GDA: (Gaussian Discriminant Analysis)

Now if you have DISCRIMINITIVE MODEL, you learn: what is the class like, given the features. Right? So that means, given an input $x$ (with some underlying distribution), what is the distribution on the classes like, conditioned on this input.
So a discriminative model learns $p (Y = y | X = x)$ or $p (y | x)$ . On the other hand, a GENERATIVE model, learns What are the features like given a class? so we are learning $p (x | y)$ .

Let us make a tumor classifier using GDA:
The standing assumption is that: conditioned to the tumor being malignant, the distribution of the features is gaussian, and conditioned to the tumor being benign, the features are also gaussian. $p (x | y = 1) \sim N (μ_{1}, Σ_{1})$ , $p (x | y = 0) \sim N (μ_{0}, Σ_{0})$ . if $Z$ is a multivariable Gaussian rv, then, $Σ_{Z} = E [(Z - μ) (Z - μ)^{T}] = E [Z Z^{T}] - E [Z] E [Z]^{T}$ . Note that if a rv is vector valued, so is its expected value, so it makes sense to take transpose and so on.
![[Support/Figures/Pasted image 20250204092720.png#invert]]

But since $x \in R^{n}$ , $μ_{0}, μ_{1} \in R^{n}$ , $Σ_{0}, Σ_{1} \in R^{n \times n}$ . where these sigmas are called co-variance matrices. These matrices are positive semi definite. (a matrix $M$ is positive semi definite if $M$ is symmetric, and $x^{T} M x \in R^{+}, \forall x \neq 0$ )
writing the multivariate gaussian densities $f_{X | Y = 1} (x; μ_{1}, Σ_{1})$ and $f_{X | Y = 0} (x; μ_{0}, Σ_{0})$ (for the malignant and benign feature-distributions resp). given a training set $(x^{i}, y^{i})_{i = 1}^{m}$ , we write the joint likelihood as a function of the chosen parameters for both the gaussians, and a parameter for the distribution of $Y$ we have

L (μ_{0}, μ_{1}, Σ_{0}, Σ_{1}, ϕ) = \prod_{i = 1}^{m} p (x^{i}, y^{i}; μ_{0}, μ_{1}, Σ_{0}, Σ_{1}, ϕ) = \prod_{i = 1}^{m} p (x^{i} | y^{i}) p (y^{i})

Where:

p (x^{i}, y^{i}) = {\begin{cases} f_{X | Y = 1} (x^{i}; μ_{1}, Σ_{1}) y^{i} = 1 \\ f_{X | Y = 0} (x^{i}; μ_{0}, Σ_{0}) y^{i} = 0 \end{cases}

and

p (y^{i}) = ϕ^{y^{i}} \cdot (1 - ϕ)^{1 - y^{i}}

As is the beautiful case with gaussians and maximizing likelihoods, the best estimate for $μ_{0}$ is the arithmetic mean of all the feature vectors which are benign. the best estimate for $μ_{1}$ is also the same idea. The best estimate for $ϕ$ is the fraction of examples that are malignant. The best estimate for the co-variance matrices are done by taking our predicted means, subtracting them from each feature vector of that class, and multiplying this quantity with its own transpose, and then averaging it out. I will leave the details out.

Naive bayes

Imagine a dataset of $m$ $n \times n$ handwritten digit images. For simplicity sake, assume each pixel is either $0$ (fully black) or $1$ (fully white). flatten the digit images into $m$ feature vectors of dimension $d = n^{2}$ each. call the input features by $x$ and their class labels $y$ . We want to learn $p (x | y)$ yet again.
The retarded way of learning this distribution, is to look at all possible feature vectors, (there are $2^{d}$ of them), for each of them, get the probability for each class,that is, for each feature vector $x$ , learn $p (x | y)$ for each $y$ . so in total we need $10 \cdot 2^{d}$ parameters.45

p (x_{1}, x_{2}, \dots x_{d} | y) = p (x_{1} | y) p (x_{2} | x_{1}, y) \dots p (x_{d} | x_{d - 1}, \dots x_{1}, y)

Does this help? now instead of learning the distribution for each feature vector and each label, we have to learn the distribution of the first component for each label, the second component conditioned on the first, and so on. How does this pan out? there are two choices for $x_{1}$ and we need to learn 10 conditional probabilities (so $20$ ), then for each of the twenty pairs of $x_{1}, y$ . we need to learn the probabilities of both the choices of $x_{2}$ , and so we go $20 + 40 + 80, . . . .$ so this is still multiplying by two over and over. this is still just as bad.

But now we commit a cardinal sin, and assume that the features's components are all conditionally (on y) independent.

Therefore, $p (x | y) = \prod_{i = 1}^{d} p (x_{i} | y)$ . Hence, we can store a matrix $P_{x y} [i] [j] \approx p (x_{i} = 1 | y = j)$ . (here the approx sign simply means estimation)
Then give $p (x_{i} = 0 | y = k)$ as $1 - P_{x y} [i] [k]$ whenever needed.

using our fancy indicator style function, $p (x_{i} | y) = (P_{x y} [i] [y])^{x_{i}} (1 - P_{x y} [i] [y])^{1 - x_{i}}$ .

In addition we can also estimate $p (y)$ by just getting the fraction of occurrences of each class.

to get $P_{x y} [i] [y = k]$ simply count the number of examples where the $i$ th pixel is white, (xi = 1) among those whose label is $k$ , and divide it by the total number of those whose label is k. and do it for all of them.

The idea with both of these generative models is that $p (y | x)$ can be computed at the end using bayes rule.