Short Term Course on Machine Learning for Practitioners

Ankita Mandal and Aparajita Khan

Indian Statistical Institute

E-mail: amandal@isical.ac.in, aparajitak_r@isical.ac.in

November 19, 2019

Supervised Feature SelectionΒΆ

Hypothesis Testing based feature selection using t-TestΒΆ

Two classes:

Class  Ο‰1  with mean  ΞΌ1  variance  Οƒ12Class  Ο‰2  with mean  ΞΌ2  variance  Οƒ22

Assumption: Οƒ12=Οƒ22=Οƒ2

Hypothesis design for feature selection in classification problem:

H1 (Difference between class means is non-zero): ΞΌ1βˆ’ΞΌ2β‰ 0H0 (Difference between class means is zero): ΞΌ1βˆ’ΞΌ2=0

Feature with higher difference between class means has greater class separability, hence better feature.

In [1]:
N=200
w1<-cbind(rnorm(N,mean=5,sd=1),rnorm(N,mean=4,sd=1))
w2<-cbind(rnorm(N,mean=15,sd=1),rnorm(N,mean=6,sd=1))
X<-rbind(w1,w2)
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N))
colvec = c("coral3","darkseagreen3")[true]
pchs= c(22,24)[true]
plot(X,col="black",bg=colvec,pch=pchs,xlab="Feature1",ylab="Feature2",main="Scatter plot of Data")
 Glimpse of Dataset X: 
         [,1]     [,2]
[1,] 3.789171 4.237898
[2,] 6.264211 4.287199
[3,] 5.703193 3.444035
[4,] 4.090515 2.175635
[5,] 6.070596 5.547517

 Dimension of Dataset: 	 Samples: 400 	 Features: 2

Test StatisticΒΆ

(1)q=(xΒ―βˆ’yΒ―)βˆ’(ΞΌ1βˆ’ΞΌ2)s2N

where

xΒ―=1Nβˆ‘i=1Nxi,xi are samples values of the feature in class Ο‰1yΒ―=1Nβˆ‘i=1Nyi,yi are samples values of the feature in class Ο‰2s2=12Nβˆ’2(βˆ‘i=1N(xiβˆ’xΒ―)2+βˆ‘i=1N(yiβˆ’yΒ―)2)class-wise sample variance.

According to null-hypothesis H0, ΞΌ1βˆ’ΞΌ2=0

q follows t-distributution with 2Nβˆ’2 degrees of freedom.

For Feature 1:ΒΆ

In [2]:
f=X[,1]
fw1=f[1:N]
fw2=f[(N+1):(2*N)]
xbar=mean(fw1)
ybar=mean(fw2)
s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
q=(xbar-ybar)/(s*sqrt(2/N))
cat("\n xbar=",xbar," ybar=",ybar)
cat("\n\n Test statistic q= ",q)
cat("\n q follows t- distribution with ",2*N-2," degrees of freedom")
 xbar= 4.949148  ybar= 15.03573

 Test statistic q=  -7.19591
 q follows t- distribution with  398  degrees of freedom

Compute p value

If p-value < 0.05: Feature has signficant difference between the class means

If p-value β‰₯ 0.05: Difference between class means along the feature is not significant

In [3]:
p.val=pt(q,2*N-2)
cat("\n p-value for hypothesis test on Feature 1= ",p.val)
alpha=0.05
if(p.val< alpha){
    cat("\n Feature is relevant")  
}else     
    cat("\n Feature  is not relevant, lacks class separability")
 p-value for hypothesis test on Feature 1=  1.5546e-12
 Feature is relevant

For Feature 2:ΒΆ

In [4]:
f=X[,2]
fw1=f[1:N]
fw2=f[(N+1):(2*N)]
xbar=mean(fw1)
ybar=mean(fw2)
s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
q=(xbar-ybar)/(s*sqrt(2/N))
cat("\n xbar=",xbar," ybar=",ybar)
cat("\n\n Test statistic q= ",q)
p.val=pt(q,2*N-2)
cat("\n p-value for hypothesis test on Feature 2= ",p.val)
if(p.val< alpha){
    cat("\n Feature is relevant")  
}else     
    cat("\n Feature  is not relevant, lacks class separability")
 xbar= 4.016379  ybar= 6.02883

 Test statistic q=  -1.330049
 p-value for hypothesis test on Feature 2=  0.09213189
 Feature  is not relevant, lacks class separability

Selection of relevant features from multiple featuresΒΆ

In [5]:
N=200
w1<-cbind(rnorm(N,mean=4,sd=1),rnorm(N,mean=7,sd=1),rnorm(N,mean=10,sd=1),rnorm(N,mean=-3.5,sd=1))
w2<-cbind(rnorm(N,mean=7,sd=1),rnorm(N,mean=16,sd=1),rnorm(N,mean=12,sd=1),rnorm(N,mean=2,sd=1))
X<-rbind(w1,w2)
colnames(X) <- c("Feature1","Feature2","Feature3","Feature4")
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N))
colvec = c("deepskyblue3","orange2")[true]
pchs= c(22,24)[true]
pairs(X, col=colvec, pch=pchs)
 Glimpse of Dataset X: 
     Feature1 Feature2  Feature3  Feature4
[1,] 3.174616 7.783940  9.648941 -3.437333
[2,] 2.675435 7.345537  8.887468 -4.900301
[3,] 4.798304 6.754174 10.188016 -4.987679
[4,] 3.822017 5.577670 10.504906 -3.293340
[5,] 3.950374 7.028291  9.211185 -2.662460

 Dimension of Dataset: 	 Samples: 400 	 Features: 4
In [6]:
d=ncol(X)
q.val=rep(0,d)
for(i in 1:d)
{
    f=X[,i]
    fw1=f[1:N]
    fw2=f[(N+1):(2*N)]
    xbar=mean(fw1)
    ybar=mean(fw2)
    s=sqrt(sum((fw1-xbar)^2)+sum((fw2-ybar)^2)/(2*N-2))
    q=(xbar-ybar)/(s*sqrt(2/N))
    p.val=pt(q,2*N-2)
    cat("\n\n Test statistic q for Feature ",i," = ",q,"p-value=",p.val)
    q.val[i]=q
}
ord=sort(abs(q.val),decreasing=TRUE,index.return=TRUE)$ix
cat("\n\n Ordering of features based on t-Test=",ord)
 Test statistic q for Feature  1  =  -2.250143 p-value= 0.0124927

 Test statistic q for Feature  2  =  -6.566627 p-value= 8.052884e-11

 Test statistic q for Feature  3  =  -1.322181 p-value= 0.09343364

 Test statistic q for Feature  4  =  -3.71284 p-value= 0.0001170868

 Ordering of features based on t-Test= 2 4 1 3

Feature selection using Fisher Discriminant RatioΒΆ

Multi-class case: ΒΆ

Classes = MΒΆ

(2)FDR=βˆ‘i=1Mβˆ‘jβ‰ iM(ΞΌiβˆ’ΞΌj)2Οƒi2+Οƒj2
In [7]:
cat("IRIS dataset\n")
print(iris[1:5,])
X<-iris[,-5]
class=as.numeric(iris$Species)
cat("\n Samples: ",dim(X)[1],"\t Features: ",dim(X)[2],"\t Classes: 3:- ",paste0(levels(iris$Species),collapse=", "))
colvec = c("coral3","darkseagreen3","darkgoldenrod2")[class]
pchs= c(22,23,24)[class]
pairs(X, col=colvec, pch=pchs)
IRIS dataset
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

 Samples:  150 	 Features:  4 	 Classes: 3:-  setosa, versicolor, virginica
In [8]:
nclass=length(unique(class))
FDR=rep(0,ncol(X))
for(d in 1:ncol(X))
{
    FDRd=0
    f=X[,d]
     for(i in 1:nclass)
     {
         for(j in 1:nclass)
         {
            if(i!=j)
            {
                fi=f[which(class==i)]
                fj=f[which(class==j)]
                FDRd=FDRd+((mean(fi)-mean(fj))^2)/(var(fi)+var(fj))
            }
            
         } 
     }
    cat("\n FDR for Feature ",d," = ",FDRd)
    FDR[d]=FDRd
}
ord=sort(abs(FDR),decreasing=TRUE,index.return=TRUE)$ix
cat("\n\n Ordering of features based on FDR: \n",paste0(colnames(X)[ord],collapse=", "))
 FDR for Feature  1  =  15.16455
 FDR for Feature  2  =  5.651219
 FDR for Feature  3  =  168.686
 FDR for Feature  4  =  128.2398

 Ordering of features based on FDR: 
 Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

Feature selection using Scatter MatricesΒΆ

Generate Dataset ΒΆ

In [9]:
N=250
w1<-cbind(rnorm(N,mean=5,sd=1.5),rnorm(N,mean=4,sd=1.2))
w2<-cbind(rnorm(N,mean=9,sd=1.3),rnorm(N,mean=8,sd=1.8))
w3<-cbind(rnorm(N,mean=15,sd=1),rnorm(N,mean=1,sd=1))
X<-rbind(w1,w2,w3)
cat(" Glimpse of Dataset X: \n")
print(X[1:5,])
cat("\n Dimension of Dataset: \t Samples:",dim(X)[1],"\t Features:",dim(X)[2])
true=c(rep(1,N),rep(2,N),rep(3,N))
colvec = c("lightpink2","turquoise2","darkolivegreen")[true]
pchs= c(22,24,21)[true]
plot(X,col="black",bg=colvec,pch=pchs,xlab="Feature1",ylab="Feature2",main="Scatter plot of Data")
 Glimpse of Dataset X: 
         [,1]     [,2]
[1,] 3.596704 2.816212
[2,] 5.376808 2.965242
[3,] 5.921716 5.442881
[4,] 4.947277 4.982256
[5,] 4.056940 5.642649

 Dimension of Dataset: 	 Samples: 750 	 Features: 2

M classes:   Ο‰1, Ο‰2, ... , Ο‰M

Class-wise prior probability:

P(Ο‰i)≃NiN
 Ni number of samples in class Ο‰i out of a total of N samples.

Class specific covariance matrix:

Ξ£i=E[(xβˆ’ΞΌi)(xβˆ’ΞΌi)T]

Global mean vector:

ΞΌ0=βˆ‘i=1MP(Ο‰i)ΞΌi

In [10]:
N=nrow(X)
nclass=length(unique(class))
mu0<-colMeans(X)
S<-list()
Mu<-list()
P<-list()
for(i in 1:nclass)
{
    wi=X[which(class==i),]
    Ni=nrow(wi)
    P[[i]]<-Ni/N
    Mu[[i]]<-colMeans(wi)
    S[[i]]<-cov(wi)
    cat("\n\n Class ",i,": ")
    cat("\n Prior P: ",P[[i]])
    cat("\n Mean Mu: ",Mu[[i]])
    cat("\n Covariance Matrix: \n")
    print(S[[i]])
}
 Class  1 : 
 Prior P:  0.06666667
 Mean Mu:  5.114443 4.004758
 Covariance Matrix: 
           [,1]       [,2]
[1,]  1.9929670 -0.4663106
[2,] -0.4663106  1.7277439


 Class  2 : 
 Prior P:  0.06666667
 Mean Mu:  5.12744 3.734254
 Covariance Matrix: 
          [,1]      [,2]
[1,] 2.0348658 0.1327746
[2,] 0.1327746 1.4389839


 Class  3 : 
 Prior P:  0.06666667
 Mean Mu:  5.376476 3.849957
 Covariance Matrix: 
            [,1]        [,2]
[1,]  2.79877673 -0.06810733
[2,] -0.06810733  1.31435905

Within-class scatter matrix ΒΆ

(3)Sw=βˆ‘i=1MP(Ο‰i)Ξ£i

Between-class scatter matrix ΒΆ

(4)Sb=βˆ‘i=1MP(Ο‰i)(ΞΌiβˆ’ΞΌ0)(ΞΌiβˆ’ΞΌ0)T

Mixture scatter matrix ΒΆ

(5)Sm=E[(xβˆ’ΞΌ0)(xβˆ’ΞΌ0)T]=Sw+Sb
In [11]:
Sw=0
Sb=0
for(i in 1:nclass)
{
    mui0=Mu[[i]]-mu0
    Sw=Sw+ P[[i]]*S[[i]]
    Sb=Sb+ P[[i]]*outer(mui0,mui0)
}
Sm=Sw+Sb
cat("\n\n Within-class scatter matrix: \n")
print(Sw)
cat("\n\n Between-class scatter matrix: \n")
print(Sb)
cat("\n\n Mixture scatter matrix: \n")
print(Sm)
 Within-class scatter matrix: 
            [,1]        [,2]
[1,]  0.45510730 -0.02677622
[2,] -0.02677622  0.29873912


 Between-class scatter matrix: 
          [,1]      [,2]
[1,] 4.0280189 0.3458786
[2,] 0.3458786 0.0322358


 Mixture scatter matrix: 
          [,1]      [,2]
[1,] 4.4831262 0.3191024
[2,] 0.3191024 0.3309749

Class separability measuresΒΆ

(5)J1=trace{Sm}trace{Sw}
(6)J2=|Sm||Sw|=|Swβˆ’1Sm|
(7)J3=trace{Swβˆ’1Sm}

|A| denotes determinant of matrix A.

In [12]:
J1=sum(diag(Sm))/sum(diag(Sw))
J2=det(solve(Sw)%*%Sm)
J3=sum(diag(solve(Sw)%*%Sm))

cat("\n J1= ",J1,"\n J2= ",J2,"\n J3= ",J3)
 J1=  6.38605 
 J2=  10.21859 
 J3=  11.14306