The idea of Gaussian discriminant analysis is that we use the Bayes theorem to compute the posterior probability \(P(Y=c|X=x)\). We assume the underlying distribution of \(X\) for each class is a normal distribution. According to Bayes theorem, we can calculate the posterior probability as follow:
\[P(Y=c|X=x)=\frac{P(X=x|Y=c)P(Y=c)}{\sum_{i=1}^{C}P(X=x|Y=i)P(Y=i)}\]
We first consider that there are only one independent variable \(X\). The assumption of normal distribution means that \[P(X=x|Y=c)=\frac{1}{\sqrt{2\pi}\sigma}\exp{(\frac{-(x-\mu_c)^2}{2\sigma_c^2})}\]
For each class \(c\), we need to estimate the mean, \(\mu_c\), variance, \(\sigma_c^2\), and prior probability, \(\pi_c=P(Y=c)\). Given \(x\), we will choose the class \(c\) that maximize the posterior probability \(P(Y=c|X=x)\) or equivalently \(P(X=x|Y=c)P(Y=c)\) because the denominator of posterior probabilities are the same for all classes. We define \(P(X=x|Y=c)\) as our discrimiant function and since \(\log\) is a monotonically increasing function, it’s equivalent to maximize \(\delta_c(X)=\log{P(X=x|Y=c)P(Y=c)}\).where \(s(\gamma) =\frac{1}{1 + e^{-\gamma}}\) is the logistic function or sigmoid function.
In 2-class case, the decision boundary can be written as \(wx + \alpha=0\) (a linear classifier). For multi-class case, we just choose class \(C\) that maximizes the \(\delta_C=\frac{\mu_c}{\sigma^2}x - \frac{\mu_C^2}{2\sigma^2} + \log{\pi_C}\).
Special Case of LDA
A special cass in 2-class: \(\pi_C = \pi_D\). The decision boundary is \[\delta_C(x) - \delta_D(x) = \frac{\mu_C-\mu_D}{\sigma^2}x - \frac{\mu_C^2-\mu_D^2}{2\sigma^2} = 0\] \[\Rightarrow (\mu_C-\mu_D)x - (\mu_C-\mu_D)\frac{\mu_C+\mu_D}{2} = 0\] As shown in Figure , we simulated data from two normal distribution with mean equal to 1.25 and -1.25 and variance equal to 1. The dashed line is the real decision boundary and the solid line in the left figure is the decision boundary of LDA.
If we don’t have the assumption of the same variance across classes, it becomes QDA. As you can see in the discriminant function, \[\delta_C=-\frac{(x-\mu_C)^2}{2\sigma_C^2}-\log\sigma_C + \log \pi_C\] which is quadratic in \(x\). For multi-class case, the decision rule is the same as the one in LDA. We choose class \(C\) that maximizes the discriminant function.
Recall that \(X\sim N(\mu, \Sigma)\) and the density function is
\[p(x)=\frac{1}{\sqrt{2\pi}^d\sqrt{|\Sigma|}}\exp{(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu))}\]
For multivariate case, the idea is the same as before. The LDA assume that the covariance matrix is the same for all classes while, in QDA, the covariance matrix may be different for each class. We can plug in the density of multivariate normal distribution and find the discriminant function \(\delta_C(X)=\log{P(X=x|Y=C)P(Y=C)}\) for both cases.
Now, we turn to the question that how should we estimate the parameters for each parameter in normal distribution as well as the prior probability? We use maximum likelihood estimate (MLE)! We are not going to prove the MLE for mean and covariance matrix of each class. Instead, we will just summarize them below:
LDA \[\delta_C(X)=\mu_C^T \Sigma^{-1}x - \frac{1}{2}\mu_C^T \Sigma^{-1}\mu_C + \log\pi_C\]
QDA \[\delta_C(X)=-\frac{1}{2}(x-\mu_C)^T\Sigma_C^{-1}(x-\mu_C) - \frac{1}{2}\log{|\Sigma_C|} + \log\pi_C\]
where \(n_c\) is the number of observation in class \(C\) and \(n\) is the total number of observation. For multi-class case, we will choose the class with the largest discriminant function.
library(ISLR)
library(car)
library(MASS)
Default$default <- as.numeric(as.character(
recode(Default$default, "'Yes'='1'; 'No'='0'")))
# LDA
LDA <- lda(default ~ balance + student, data=Default)
# Fitted value
Prediction <- predict(LDA, Default)$class
t <- table(Predict=Prediction, True=Default$default)
# addmargins: comput all margin of the table
# ftable: make the table format nicer
ftable(addmargins(t))
True 0 1 Sum
Predict
0 9644 252 9896
1 23 81 104
Sum 9667 333 10000
# QDA
QDA <- qda(default ~ balance + student, data=Default)
Prediction <- predict(QDA, Default)$class
t2 <- table(Predict=Prediction, True=Default$default)
ftable(addmargins(t2))
True 0 1 Sum
Predict
0 9637 244 9881
1 30 89 119
Sum 9667 333 10000
threshold <- 0.2
Prediction_new <- (predict(LDA, Default)$posterior[, 2] > threshold)*1
t_new <- table(Predict=Prediction_new, True=Default$default)
ftable(addmargins(t_new))
True 0 1 Sum
Predict
0 9432 138 9570
1 235 195 430
Sum 9667 333 10000
There are several measure that can help us to determine the performance of our model or classifier.
Accuracy: \(\dfrac{TP+TN}{TP+FP+TN+FN}\)
Specificity: \(\dfrac{TN}{TN+FP}\)
True Positive Rate (Sensitivity, Recall): \(\dfrac{TP}{TP+FN}\)
False Positive Rate (Type I error, 1 - Specificity): \(\dfrac{FP}{TN+FP}\)
Positive Predicted Value (Precision, 1 - False Discovery Rate): \(\dfrac{TP}{TP+FP}\)
Negative Predicted Value: \(\dfrac{TN}{TN+FN}\)
library(caret)
# Confusion Matrix
confusionMatrix(t, positive="1")
Confusion Matrix and Statistics
True
Predict 0 1
0 9644 252
1 23 81
Accuracy : 0.9725
95% CI : (0.9691, 0.9756)
No Information Rate : 0.9667
P-Value [Acc > NIR] : 0.0004973
Kappa : 0.3606
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.2432
Specificity : 0.9976
Pos Pred Value : 0.7788
Neg Pred Value : 0.9745
Prevalence : 0.0333
Detection Rate : 0.0081
Detection Prevalence : 0.0104
Balanced Accuracy : 0.6204
'Positive' Class : 1
library(AUC)
# ROC and AUC of the classifier
plot(roc(predict(LDA, Default)$posterior[, 2], as.factor(Default$default)),
col="blue")
auc_value <- auc(roc(predict(LDA, Default)$posterior[, 2],
as.factor(Default$default)))
text(0.4, 0.6, paste("AUC = ", round(auc_value, 4)))