Short Term Course on Machine Learning for Practitioners

Ankita Mandal and Aparajita Khan

Indian Statistical Institute

E-mail: amandal@isical.ac.in, aparajitak_r@isical.ac.in

November 19, 2019

Supervised Feature Selection¶

Hypothesis Testing based feature selection using $t$ -Test¶

Two classes:

\begin{aligned} Class ω_{1} with mean μ_{1} variance σ_{1}^{2} \\ Class ω_{2} with mean μ_{2} variance σ_{2}^{2} \end{aligned}

$\begin{align} \text{Class} \space \space \boldsymbol{\omega_1} \space \space \text{with mean} \space \space \mu_1 \space \space \text{variance} \space \space \sigma_1^2 \\ \text{Class} \space \space \boldsymbol{\omega_2} \space \space \text{with mean} \space \space \mu_2 \space \space \text{variance} \space \space \sigma_2^2 \end{align}$

Assumption: $\sigma_1^2 = \sigma_2^2 = \sigma^2$

Hypothesis design for feature selection in classification problem:

\begin{aligned} H_{1} (Difference between class means is non-zero): & μ_{1} - μ_{2} \neq 0 \\ H_{0} (Difference between class means is zero): & μ_{1} - μ_{2} = 0 \end{aligned}

$\begin{align} & H_1 \mbox{ (Difference between class means is non-zero): }& \mu_1 - \mu_2 \neq 0 \\ & H_0 \mbox{ (Difference between class means is zero): }& \mu_1 - \mu_2 = 0 \\ \end{align}$

Feature with higher difference between class means has greater class separability, hence better feature.

N=200
w1<-cbind(rnorm(N,mean=5,sd=1),rnorm(N,mean=4,sd=1))
w2<-cbind(rnorm(N,mean=15,sd=1),rnorm(N,mean=6,sd=1))
X<-rbind(w1,w2)
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N))
colvec = c("coral3","darkseagreen3")[true]
pchs= c(22,24)[true]
plot(X,col="black",bg=colvec,pch=pchs,xlab="Feature1",ylab="Feature2",main="Scatter plot of Data")

 Glimpse of Dataset X: 
         [,1]     [,2]
[1,] 3.789171 4.237898
[2,] 6.264211 4.287199
[3,] 5.703193 3.444035
[4,] 4.090515 2.175635
[5,] 6.070596 5.547517

 Dimension of Dataset: 	 Samples: 400 	 Features: 2

Test Statistic¶

\begin{matrix} (1) & q = \frac{(\bar{x} - \bar{y}) - (μ_{1} - μ_{2})}{s \sqrt{\frac{2}{N}}} \end{matrix}

$\begin{equation} q = \frac{(\bar{x}-\bar{y})-(\mu_1 - \mu_2)}{s \sqrt{\frac{2}{N}}} \tag{1} \end{equation}$

where

\begin{aligned} \bar{x} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}, x_{i} are samples values of the feature in class ω_{1} \\ \bar{y} = \frac{1}{N} \sum_{i = 1}^{N} y_{i}, y_{i} are samples values of the feature in class ω_{2} \\ s^{2} = \frac{1}{2 N - 2} (\sum_{i = 1}^{N} (x_{i} - \bar{x})^{2} + \sum_{i = 1}^{N} (y_{i} - \bar{y})^{2}) class-wise sample variance. \end{aligned}

$\begin{align} & \bar{x}= \frac{1}{N} \sum\limits_{i=1}^{N} x_i, \hspace{1cm} x_i \mbox{ are samples values of the feature in class } \boldsymbol{\omega_1} \\ & \bar{y}= \frac{1}{N} \sum\limits_{i=1}^{N} y_i, \hspace{1cm} y_i \mbox{ are samples values of the feature in class } \boldsymbol{\omega_2} \\ & s^2= \frac{1}{2N-2} \left( \sum\limits_{i=1}^{N} (x_i - \bar{x})^2 + \sum\limits_{i=1}^{N} (y_i - \bar{y})^2\right) \hspace{1cm} \mbox{class-wise sample variance.} \end{align}$

According to null-hypothesis $H_0$ , $\mu_1 - \mu_2=0$

$q$ follows $t$ -distributution with $2N-2$ degrees of freedom.

For Feature 1:¶

f=X[,1]
fw1=f[1:N]
fw2=f[(N+1):(2*N)]
xbar=mean(fw1)
ybar=mean(fw2)
s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
q=(xbar-ybar)/(s*sqrt(2/N))
cat("\n xbar=",xbar," ybar=",ybar)
cat("\n\n Test statistic q= ",q)
cat("\n q follows t- distribution with ",2*N-2," degrees of freedom")

 xbar= 4.949148  ybar= 15.03573

 Test statistic q=  -7.19591
 q follows t- distribution with  398  degrees of freedom

Compute $p$ value

If p-value < 0.05: Feature has signficant difference between the class means

If p-value $\geq$ 0.05: Difference between class means along the feature is not significant

p.val=pt(q,2*N-2)
cat("\n p-value for hypothesis test on Feature 1= ",p.val)
alpha=0.05
if(p.val< alpha){
    cat("\n Feature is relevant")  
}else     
    cat("\n Feature  is not relevant, lacks class separability")

 p-value for hypothesis test on Feature 1=  1.5546e-12
 Feature is relevant

For Feature 2:¶

f=X[,2]
fw1=f[1:N]
fw2=f[(N+1):(2*N)]
xbar=mean(fw1)
ybar=mean(fw2)
s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
q=(xbar-ybar)/(s*sqrt(2/N))
cat("\n xbar=",xbar," ybar=",ybar)
cat("\n\n Test statistic q= ",q)
p.val=pt(q,2*N-2)
cat("\n p-value for hypothesis test on Feature 2= ",p.val)
if(p.val< alpha){
    cat("\n Feature is relevant")  
}else     
    cat("\n Feature  is not relevant, lacks class separability")

 xbar= 4.016379  ybar= 6.02883

 Test statistic q=  -1.330049
 p-value for hypothesis test on Feature 2=  0.09213189
 Feature  is not relevant, lacks class separability

Selection of relevant features from multiple features¶

N=200
w1<-cbind(rnorm(N,mean=4,sd=1),rnorm(N,mean=7,sd=1),rnorm(N,mean=10,sd=1),rnorm(N,mean=-3.5,sd=1))
w2<-cbind(rnorm(N,mean=7,sd=1),rnorm(N,mean=16,sd=1),rnorm(N,mean=12,sd=1),rnorm(N,mean=2,sd=1))
X<-rbind(w1,w2)
colnames(X) <- c("Feature1","Feature2","Feature3","Feature4")
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N))
colvec = c("deepskyblue3","orange2")[true]
pchs= c(22,24)[true]
pairs(X, col=colvec, pch=pchs)

 Glimpse of Dataset X: 
     Feature1 Feature2  Feature3  Feature4
[1,] 3.174616 7.783940  9.648941 -3.437333
[2,] 2.675435 7.345537  8.887468 -4.900301
[3,] 4.798304 6.754174 10.188016 -4.987679
[4,] 3.822017 5.577670 10.504906 -3.293340
[5,] 3.950374 7.028291  9.211185 -2.662460

 Dimension of Dataset: 	 Samples: 400 	 Features: 4

d=ncol(X)
q.val=rep(0,d)
for(i in 1:d)
{
    f=X[,i]
    fw1=f[1:N]
    fw2=f[(N+1):(2*N)]
    xbar=mean(fw1)
    ybar=mean(fw2)
    s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
    q=(xbar-ybar)/(s*sqrt(2/N))
    p.val=pt(q,2*N-2)
    cat("\n\n Test statistic q for Feature ",i," = ",q,"p-value=",p.val)
    q.val[i]=q
}
ord=sort(abs(q.val),decreasing=TRUE,index.return=TRUE)$ix
cat("\n\n Ordering of features based on t-Test=",ord)

 Test statistic q for Feature  1  =  -2.250143 p-value= 0.0124927

 Test statistic q for Feature  2  =  -6.566627 p-value= 8.052884e-11

 Test statistic q for Feature  3  =  -1.322181 p-value= 0.09343364

 Test statistic q for Feature  4  =  -3.71284 p-value= 0.0001170868

 Ordering of features based on t-Test= 2 4 1 3

Feature selection using Fisher Discriminant Ratio¶

Multi-class case: ¶

Classes = $M$ ¶

\begin{matrix} (2) & F D R = \sum_{i = 1}^{M} \sum_{j \neq i}^{M} \frac{(μ_{i} - μ_{j})^{2}}{σ_{i}^{2} + σ_{j}^{2}} \end{matrix}

$\begin{equation} FDR =\sum\limits_{i=1}^M \sum\limits_{j \neq i}^M \frac{ (\mu_i -\mu_j)^2}{\sigma_i^2+\sigma_j^2} \tag{2} \end{equation}$

cat("IRIS dataset\n")
print(iris[1:5,])
X<-iris[,-5]
class=as.numeric(iris$Species)
cat("\n Samples: ",dim(X)[1],"\t Features: ",dim(X)[2],"\t Classes: 3:- ",paste0(levels(iris$Species),collapse=", "))
colvec = c("coral3","darkseagreen3","darkgoldenrod2")[class]
pchs= c(22,23,24)[class]
pairs(X, col=colvec, pch=pchs)

IRIS dataset
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

 Samples:  150 	 Features:  4 	 Classes: 3:-  setosa, versicolor, virginica

nclass=length(unique(class))
FDR=rep(0,ncol(X))
for(d in 1:ncol(X))
{
    FDRd=0
    f=X[,d]
     for(i in 1:nclass)
     {
         for(j in 1:nclass)
         {
            if(i!=j)
            {
                fi=f[which(class==i)]
                fj=f[which(class==j)]
                FDRd=FDRd+((mean(fi)-mean(fj))^2)/(var(fi)+var(fj))
            }
            
         } 
     }
    cat("\n FDR for Feature ",d," = ",FDRd)
    FDR[d]=FDRd
}
ord=sort(abs(FDR),decreasing=TRUE,index.return=TRUE)$ix
cat("\n\n Ordering of features based on FDR: \n",paste0(colnames(X)[ord],collapse=", "))

 FDR for Feature  1  =  15.16455
 FDR for Feature  2  =  5.651219
 FDR for Feature  3  =  168.686
 FDR for Feature  4  =  128.2398

 Ordering of features based on FDR: 
 Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

Feature selection using Scatter Matrices¶

Generate Dataset ¶

N=250
w1<-cbind(rnorm(N,mean=5,sd=1.5),rnorm(N,mean=4,sd=1.2))
w2<-cbind(rnorm(N,mean=9,sd=1.3),rnorm(N,mean=8,sd=1.8))
w3<-cbind(rnorm(N,mean=15,sd=1),rnorm(N,mean=1,sd=1))
X<-rbind(w1,w2,w3)
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N),rep(3,N))
colvec = c("lightpink2","turquoise2","darkolivegreen")[true]
pchs= c(22,24,21)[true]
plot(X,col="black",bg=colvec,pch=pchs,xlab="Feature1",ylab="Feature2",main="Scatter plot of Data")

 Glimpse of Dataset X: 
         [,1]     [,2]
[1,] 3.596704 2.816212
[2,] 5.376808 2.965242
[3,] 5.921716 5.442881
[4,] 4.947277 4.982256
[5,] 4.056940 5.642649

 Dimension of Dataset: 	 Samples: 750 	 Features: 2

$M$ classes: $\space \space \boldsymbol{\omega_1}, \space \boldsymbol{\omega_2}, \space ...\space, \space \boldsymbol{\omega_M}$

Class-wise prior probability:

P (ω_{i}) ≃ \frac{N_{i}}{N}

$\begin{equation} P(\boldsymbol{\omega_i}) \simeq \frac{N_i}{N} \end{equation}$

N_{i}

$\space N_i$ number of samples in class

ω_{i}

$\boldsymbol{\omega_i}$ out of a total of

N

$N$ samples.

Class specific covariance matrix:

Σ_{i} = E [(x - μ_{i}) (x - μ_{i})^{T}]

$\begin{equation} \Sigma_i=E[(x-\mu_i)(x-\mu_i)^T] \end{equation}$

Global mean vector:

μ_{0} = \sum_{i = 1}^{M} P (ω_{i}) μ_{i}

$\begin{equation} \mu_0= \sum \limits_{i=1}^M P(\boldsymbol{\omega_i}) \mu_i \end{equation}$

N=nrow(X)
nclass=length(unique(class))
mu0<-colMeans(X)
S<-list()
Mu<-list()
P<-list()
for(i in 1:nclass)
{
    wi=X[which(class==i),]
    Ni=nrow(wi)
    P[[i]]<-Ni/N
    Mu[[i]]<-colMeans(wi)
    S[[i]]<-cov(wi)
    cat("\n\n Class ",i,": ")
    cat("\n Prior P: ",P[[i]])
    cat("\n Mean Mu: ",Mu[[i]])
    cat("\n Covariance Matrix: \n")
    print(S[[i]])
}

 Class  1 : 
 Prior P:  0.06666667
 Mean Mu:  5.114443 4.004758
 Covariance Matrix: 
           [,1]       [,2]
[1,]  1.9929670 -0.4663106
[2,] -0.4663106  1.7277439


 Class  2 : 
 Prior P:  0.06666667
 Mean Mu:  5.12744 3.734254
 Covariance Matrix: 
          [,1]      [,2]
[1,] 2.0348658 0.1327746
[2,] 0.1327746 1.4389839


 Class  3 : 
 Prior P:  0.06666667
 Mean Mu:  5.376476 3.849957
 Covariance Matrix: 
            [,1]        [,2]
[1,]  2.79877673 -0.06810733
[2,] -0.06810733  1.31435905

Within-class scatter matrix ¶

\begin{matrix} (3) & S_{w} = \sum_{i = 1}^{M} P (ω_{i}) Σ_{i} \end{matrix}

$\begin{equation} S_w =\sum\limits_{i=1}^M P(\boldsymbol{\omega_i}) \Sigma_i \tag{3} \end{equation}$

Between-class scatter matrix ¶

\begin{matrix} (4) & S_{b} = \sum_{i = 1}^{M} P (ω_{i}) (μ_{i} - μ_{0}) (μ_{i} - μ_{0})^{T} \end{matrix}

$\begin{equation} S_b=\sum \limits_{i=1}^M P(\boldsymbol{\omega_i}) (\mu_i-\mu_0)(\mu_i-\mu_0)^T \tag{4} \end{equation}$

Mixture scatter matrix ¶

\begin{matrix} (5) & S_{m} = E [(x - μ_{0}) (x - μ_{0})^{T}] = S_{w} + S_{b} \end{matrix}

$\begin{equation} S_m=E[(x-\mu_0)(x-\mu_0)^T]=S_w+S_b \tag{5} \end{equation}$

Sw=0
Sb=0
for(i in 1:nclass)
{
    mui0=Mu[[i]]-mu0
    Sw=Sw+ P[[i]]*S[[i]]
    Sb=Sb+ P[[i]]*outer(mui0,mui0)
}
Sm=Sw+Sb
cat("\n\n Within-class scatter matrix: \n")
print(Sw)
cat("\n\n Between-class scatter matrix: \n")
print(Sb)
cat("\n\n Mixture scatter matrix: \n")
print(Sm)

 Within-class scatter matrix: 
            [,1]        [,2]
[1,]  0.45510730 -0.02677622
[2,] -0.02677622  0.29873912


 Between-class scatter matrix: 
          [,1]      [,2]
[1,] 4.0280189 0.3458786
[2,] 0.3458786 0.0322358


 Mixture scatter matrix: 
          [,1]      [,2]
[1,] 4.4831262 0.3191024
[2,] 0.3191024 0.3309749

Class separability measures¶

\begin{matrix} (5) & J_{1} = \frac{t r a c e {S_{m}}}{t r a c e {S_{w}}} \end{matrix}

$\begin{equation} J_1=\frac{trace\{S_m\}}{trace\{S_w\}} \tag{5} \end{equation}$

\begin{matrix} (6) & J_{2} = \frac{| S_{m} |}{| S_{w} |} = | S_{w}^{- 1} S_{m} | \end{matrix}

$\begin{equation} J_2=\frac{\lvert S_m \rvert}{\lvert S_w \rvert}=\lvert S_w^{-1} S_m \rvert \tag{6} \end{equation}$

\begin{matrix} (7) & J_{3} = t r a c e {S_{w}^{- 1} S_{m}} \end{matrix}

$\begin{equation} J_3=trace\{ S_w^{-1} S_m \} \tag{7} \end{equation}$

$\lvert A \rvert$ denotes determinant of matrix $A$ .

J1=sum(diag(Sm))/sum(diag(Sw))
J2=det(solve(Sw)%*%Sm)
J3=sum(diag(solve(Sw)%*%Sm))

cat("\n J1= ",J1,"\n J2= ",J2,"\n J3= ",J3)

 J1=  6.38605 
 J2=  10.21859 
 J3=  11.14306