```METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
EE 583
PATTERN RECOGNITION
Statistical Pattern Recognition
Bayes Decision Theory
Supervised Learning
Linear Discriminant Functions
Unsupervised Learning
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Supervised Learning
Supervised Learning == Training
Parametric approaches
Maximum likelihood estimation
Bayesian parameter estimation
Non-parametric approaches
Direct pdf (multi-D histogram) estimation
Parzen window pdf estimation
kn-nearest neighbor pdf estimation
Nearest-neighbor rule
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Parametric Approaches
“Curse of dimensionality” : We need lots of
training data to determine the completely
unknown statistics for multi-D problems
A rule of thumb : “use at least 10 times as many
training samples per class as the number of
features (i.e. D)”
Hence, with some a priori information, it is
possible to estimate the parameters of the
known distribution by using less number of
samples
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Maximum Likelihood Estimation (1/4)
Assume c sets of samples, drawn according to
p ( x | ω j ) which has a known parametric form.
e.g. pdf is known to be Gaussian; mean & variance values are unknown
r
Let Θ j be unknown deterministic parameter set of pdf for class-j
r
p( x | ω j ) = p( x | ω j , Θ j )
: shows the dependence
Aim : Use the information provided by the observed
samples to estimate the unknown parameter
Note that all sets of samples have independent pdf’s,
there are c separate problems
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Maximum Likelihood Estimation (2/4)
For an arbitrary class, let an observed sample set, X,
contain n samples, X={x1,…,xn}.
Assume the samplesrare independently drawn from
their density, p ( x | Θ )
The likelihood of the observed sample set, X :
n
r
r
p ( X | Θ ) = ∏ p ( xk | Θ )
k =1
r
Find value of the parameter that maximizes p ( X | Θ )
In order to find the parameter that maximizes its
value, differentiate the conditional probability and
equate to zero
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Maximum Likelihood Estimation (3/4)
r
Find value of unknown parameter maximizes p ( X | Θ )
p(X|Θ)
For different Θ, the
observed samples gives
different p(X|Θ) values
for p(xk|Θ) densities
log p(X|Θ)
The argument for the
maximum of such
products is ML estimate
p(x|Θ1)
p(x|Θ2)
log p(X|Θ) will not
differ the argument of
this maxima
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Maximum Likelihood Estimation (4/4)
Better to work with logarithm for analytical purposes.
n
r
r
r
l (Θ ) = log p ( X | Θ ) = ∑ log p ( x k | Θ )
k =1
Note: Taking logarithm does not effect finding the maxima
Differentiate l(Θ) and equate
it to zero.
n
r
r
∇ Θ l (Θ ) = ∑ ∇ Θ log p ( x k | Θ ) = 0
k =1
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
ML Estimate of Univariate Normal :
Assume mean θ1 & variance θ22 are unknown for a Gaussian pdf:
1
1
log p ( x k | Θ ) = − log{( 2π )θ 2 } −
( xk − θ 1 ) 2
2
2θ 2
1


−
(
x
)
θ
k
1
 θ

2
Differentiate wrt θ1 and θ2 : ∇ Θ log p ( xk | Θ ) = 
2 
−
(
x
)
θ
1
−
+ k 21 
 2θ 2

2θ 2


Maximum likelihood estimates of the parameters :
n
1
∑θ
k =1
n
−∑
( x k − θ1 ) = 0
2
1
k =1 θ 2
n
( x k − θ1 ) 2
k =1
θ22
+∑
n
1
⇒ θˆ1 = ∑ xk
n k =1
n
1
= 0 ⇒ θˆ2 = ∑ ( xk − θˆ1 ) 2
n k =1
ML
estimates
of mean
and
variance
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
ML Estimate of Multivariate Normal :
Assume only mean vector is unknown :
r r
r t −1 r
r
1
1 r
d
log p ( xk | µ ) = − log{( 2π ) | Σ |} − ( xk − µ ) Σ ( x k − µ )
2
2
Differentiate
r r
r
−1 r
∇ µ log p ( x k | µ ) = Σ ( xk − µ )
Maximum likelihood estimate of the unknown mean vector :
n
r
−1 r
∑ Σ ( xk − µ ) = 0 ⇒
k =1
rˆ 1 n r
µ = ∑ xk
n k =1
MLE of mean is the arithmetic average of vector samples
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Bayesian Parameter Estimation (1/3)
Can we incorporate a priori knowledge about the unknown
parameters into the formulation?
Remember, Bayesian minimum error rate classifier maximizes p(ωi|x)
Assume the role of the observed sample set, X, is emphasized :
r
r
p ( x | ω i , X ) P (ω i | X )
P (ω i | x , X ) = c
r
p
(
x
| ω j , X ) P (ω j | X )
∑
j =1
Assume a priori probabilities are known :
P (ω i | X ) = P (ω i )
Assume sample sets of classes are independent,
r
r
p
(
x
|
ω
,
X
)
=
p
(
x
|ωi, X i)
c separate problems
i
r
= p(x | X )
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Bayesian Parameter Estimation (2/3)
r
P (ω | x , X ) =
c
∑
j =1
r
p ( x | X ) P (ω )
r
p ( x | ω j , X ) P (ω j )
r
Main aim is to compute p( x | X )
r
r
p ( x | X ) = ∫ p ( x , Θ | X )dΘ =
∫
r
p ( x | Θ ) p (Θ | X ) dΘ
1
424
3 142 4
3
form is known
?
r
Samples are drawn independently according to p ( x | Θ )
whose parametric form is known
Bayesian approach assumes that the unknown
parameter is a random variable with a known density p ( Θ )
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Bayesian Parameter Estimation (3/3)
ˆ , we obtain
If p(Θ | X ) is peakly sharped at some value Θ
r
r
r ˆ
p( x | X ) = ∫ p( x | Θ) p(Θ | X )dΘ ≈ p( x | Θ)
p(Θ)
p(Θ | X )
ˆ
Θ
If we are not sure about the value (i.e. no sharp peak),
the result is the average over possible values of Θ
How to determine p(Θ|X) ?
For various densities, different analytical results exist
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Bayesian Parameter Estimation
Univariate Normal Distribution (1/3)
A univariate normal distribution with unknown µ
p( x | µ ) ~ N (µ,σ 2 )
A priori information about µ is expressed by density
p(µ ) ~ N (µ0 ,σ 0 )
2
Observing the sample set, D, p(µ|D) becomes
p(µ | D) =
p( D | µ ) p(µ )
∫ p ( D | µ ) p ( µ ) dµ
n
= α ∏ p ( xk | µ ) p ( µ )
k =1
1 µ − µ0 2
1 xk − µ 2 
−
(
) 
−
(
)
 n

1
1
2
σ
2
σ
0


p ( µ | D ) =  α ∏
e
e
 2π σ

2
π
σ
k
=
1


0

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Bayesian Parameter Estimation
Univariate Normal Distribution (2/3)
p(µ | D) = α ′ e
n
x − µ 2 µ − µ 0 2 
1
− 
( k
) +(
)

2  k =1 σ
σ0

∑
= α ′′ e
1 n
1
1
−  ( 2 + 2 ) µ 2 −2( 2

2 σ
σ0
σ
n
µ
k =1
Increasing number of samples
As n ∞, p(µ|D)

σ
σ σ0
nσ 02
2
mn +
µn =
µ0 ; σ n =
2
2 {
2
2
nσ 0 + σ 1
nσ 0 + σ
nσ 02 + σ 2
xk
n∑
2
⇒ p ( µ | D ) ~ N ( µ n , σ n2 ) ,

∑ xk + σ 002 ) µ 
δ(µ)
p(µ|D) sharper peak
Bayesian Learning
2
2
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Bayesian Parameter Estimation
Univariate Normal Distribution (3/3)
After determining p(µ|D), p(x|D) is obtained by
p( x | D) = ∫ p( x | µ ) p(µ | D)dµ
1 µ − µn 2
1 x−µ 2 
(
) 
−
− (
) 
 1
1
2
σ
n
dµ
⇒ p( x | D) = ∫ 
e 2 σ 
e

 2π σ 1 ( x−µ )2  2π σ n

n
−
1
2 σ 2 +σ n2
⇒ p( x | D) =
e
f (σ ,σ n )
2π σ σ n
⇒ p( x | D) ~ N (µn ,σ 2 + σ n2 )
Compared to the initial knowledge, p(x|µ), about µ, p(x|D) has
additional uncertainty due to lack of exact knowledge of µ.
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
General Bayesian Learning
In summary :
• The form of the density, p(x|Θ), is assumed to be
known, but the value of parameter, Θ, is unknown
• Our initial knowledge about the parameter, Θ, is
assumed to be contained in a known a priori density,
p(Θ).
• The rest of our knowledge about the parameter,
Θ, is contained in n samples, drawn according to the
unknown probability p(x|Θ)
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Comparison : ML vs. Bayesian
ML avoids many assumptions and
analytically easier to solve, although some
estimates can be biased
Bayesian parameter estimation permits
including a priori information about the
unknown, but the analytical derivations are
cumbersome.
For ordinary cases, both approaches give
similar results with sufficient sample data
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Parametric approaches require
Knowing the form of the density
Finding the parameter of the density
In many cases,
The form is not known
The form does not let you to find a unique
solution (multi-modal densities)
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
The solution is to use non-parametric
approaches which do not assume a
form
There are 2 main directions :
Estimating densities non-parametrically
Direct estimation of density
Parzen window
k-NN estimation
Nearest Neighbor Rules
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Density Estimation (1/3)
Probability P of a vector x falling into region R :
r r
P = ∫ p ( x ′) dx ′
ℜ
N samples of x independently drawn according to p(x)
Probability of k independent samples fall into R (Binomial):
n k
Pk =   P (1 − P) n − k
k 
and
E[k ] = nP, var(k ) = nP(1 − P)
Since Binomial distribution peaks very sharply around the
expected value, the number of observed samples (kobs) in R should
be approximately equal k obs ≈ E[ k ] = nP
Note that probability P can be estimated via P ≈ k obs / n , but we
need density, p(x)
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Density Estimation (2/3)
r r
r
′
′
Assume p(x) is almost constant in R : ∫ p ( x ) dx ≈ p ( x )V
where V is the volume of R
ℜ
Hence, one will obtain the obvious result by
combining previous relations :
r kobs / n
p( x ) ≈
V
There are two approximations (≈) in previous relations
• If k (or n) goes to infinity
or
V goes to zero
then those approximations will converge to exact values
For finite n, fixing V and k independent of n yields problems :
•
If V
0 then p(x) ≈ 0 (useless)
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Density Estimation (3/3)
Form a sequence of regions, Rn ,centered at x for n samples
3 conditions under which
(1 )
lim V n = 0
n→ ∞
(2)
r r
r
lim ∫ pn ( x ′) dx ′ = p( x )
n →∞
ℜ
lim k n = ∞
n→ ∞
(3)
r k /n
pn ( x ) ≡ n
Vn
lim
n→ ∞
Examples that achieve these conditions :
V0
n
• k-NN : Rn is grown until it contains kn samples
• Parzen : Initial Vo volume is shrinking
Vn =
kn = n
kn
= 0
n
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Parzen Windows (1/2)
Assume region Rn is a d-dimensional hypercube with
the length of an edge as hn
The number of samples falling in Rn can be
obtained analytically by using the window function :
| u j | ≤ 1 / 2 j = 1,K , d
r 1
Φ (u ) = 
otherwise
0
For a hypercube (centered at x), number of samples
and estimate rforr the density are obtained ras :r
n
x − xi
x − xi
1 n 1
kn = ∑ Φ(
) and pn ( x) = ∑ Φ (
)
hn
n i =1 Vn
hn
i =1
hn
hn+1
x
x
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Parzen Windows (2/2)
hn
The window function can be generalized for better
interpolation of the density : each sample contribute
to the estimate based on its distance to x.
x
• If hn is very large, then pn(x) is a superposition of slowly
changing functions & an “out-of-focus” estimate
• If hn is very small, then window function is a Dirac delta
function and estimate is sum of sharp pulses
hn
x1
x2
x3 x4 x5
x6
x7
x
With unlimited number of samples, pn(x) converges to the unknown
density for any value of hn
With limited number of samples, the best option is to seek for an
acceptable compromise
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Example : Parzen Windows (1/2)
hn =
h1
n
Normal density
Window function :
1
1 − 2u2
Φ (u ) =
e
2π
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Example : Parzen Windows (2/2)
hn =
h1
n
Bi-modal density
Window function :
1
1 − 2u2
Φ (u ) =
e
2π
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
kn-Nearest Neighbor
Parzen window approach depends on the initial
selection of the cell volume, V
One remedy is to choose the cell volume as a
function of the data, rather than an arbitrary
function of number of samples
In order to estimate p(x) from n samples, center a
cell around x and grow until it captures kn nearest
k /n
samples (kn is a function of n). Resulting p(x) : pn ( x) = n
Vn
Necessary conditions for convergence :
kn
lim k n = ∞ and lim = 0 (e.g . k n = n )
n →∞
n→∞ n
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Example : kn-Nearest Neighbor
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Parzen vs kn-Nearest Neighbor
Both methods do converge, but it is very difficult
to make meaningful statements about their finitesample behaviour
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Classification Rule
All 3 methods (direct, Parzen, kn-NN) can be used to
obtain a posteriori probabilities for n-sample data
At each cell, total k samples; ki samples for each class
ki / n
p n ( x, ω i ) =
Vn
Pn ( ω i | x ) =
pn ( x,ω i )
c
∑
j =1
pn ( x,ω j )
=
ki
k
Cell size selection can be achieved by using either
Parzen window or kn-NN approach
Using arbitrarily large number of samples, unknown
probabilities can be obtained with optimum performance
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Nearest Neighbor Rule (1/3)
All 3 methods (direct, Parzen, kn-NN) can be used to
obtain a posteriori probabilities by using n-sample data
so that this density is utilized for Bayes Decision Rule
A radical approach is to use the nearest neighbor out
of the sample data to classify the unknown test data
(Nearest Neighbor Rule [NN-R])
While Bayes Rule (minimum-error rate) is optimal
while choosing between different classes, NN-R is
suboptimal
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Nearest Neighbor Rule (2/3)
Assume that there are unlimited number of labeled
“prototypes” for each class
If the test point x is nearest to one of these
prototypes, x’
p(wi|x) ≈ p(wi|x’) for all i
Obviously, x’ labeled with m gives p(wm |x’) > p(wj|x’) for
all j ≠ m
one should expect p(wm |x) > p(wj|x) for all j ≠ m
For unlimited samples, the error rate for NN-R is less
than twice the error rate of Bayes decision rule
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
Nearest Neighbor Rule (3/3)
NN-rule allows to partition the feature space into cells
consisting of all points closer to a given training point
than any other training point (Voronoi tessellation)
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Non-Parametric Approaches
k-Nearest Neighbor Rule
A straight forward extension to Nearest Neighbor rule
is using k-neighbors instead of only one.
The classification is achieved by voting k neighbors
(k is usually selected as odd to avoid ties)
Selecting k requires a compromise :
• If k is too high
some of these k neighbors may
have different probabilities, for finite n
• If k is too low
estimation may not be reliable
k=5
The optimal behavior is obtained as both k and n
approaches to infinity.
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Dimension Reduction
In supervised learning, excessive dimensionality of
features should be decreased. The main approaches are
• Principal Component Analysis
• Unsupervised
• Fisher’s Linear Discriminant
• Supervised (data with class info is required)
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Principal Component Analysis
r
r
Assume there n vectors in d-dimensions: { x 1 , K , x n }
These vectors are represented by their projections
onto a line passing, e, through their sample mean, m
r
r
r
x = m + ae
For a fixed line, the optimal a coefficients that
minimize the distance between points and the line :
n
r
r
r
r
min J ( a 1 , K , a n , e ) = min ∑ ( m + a k e ) − x k
a i ... a n
⇒ J (.) =
n
∑
k =1
a
2
k
r
e
∂ J (.)
∂ak
2
a i ... a n
k =1
n
t
k
k =1
− 2∑
= 0
r r
r
a e ( xk − m ) +
⇒
ak
n
∑
k =1
rt r
r
= e ( xk − m )
2
r
r
xk − m
2
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Principal Component Analysis
Assume a coefficients are obtained; the same cost
function, J(.), is minimized wrt to the line direction, e
n
r
r
r
r 2
min
J ( a 1 , K , a n , e ) = min
(m + ak e ) − xk
r
r
∑
e
e
rt r
r
k =1
where a k = e ( x k − m )
Define scatter matrix, S, (similar to covariance) as
n
r
r r
r t
S ≡ ∑ ( x k − m )( x k − m )
k =1
⇒ J (.) =
n
∑
k =1
n
r 2
r r
r
a e − 2∑ ak e t ( xk − m ) +
142 43
{
k =1
2
k
=1
n
= −∑
k =1
= a kt
(
rt r
r
e ( xk − m )
)
2
+
n
rt r
r
r
= − e S e + ∑ xk − m
k =1
n
∑
k =1
2
r
r
xk − m
⇒
n
∑
k =1
r
r
xk − m
2
2
rt r
min
J (.) = max
e Se
r
r
e
e
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Principal Component Analysis
rt r
min
J (.) = max
e Se
r
r
e
e
Maximum of etSe must be obtained by the constraint |e|=1
r r
r r
r
r
∂u
Lagrange mul. : u ≡ e t Se + λ (1 − e t e ) ⇒
= 0 ⇒ 2 S e − 2λ e = 0
∂e
Solution is equal to e which is the eigenvector of S, corresponding
its largest eigenvalue
Result can be generalized to d’-dimensional projection
by minimizing the following relation
d′
2
r 
 r
J d ′ = ∑  m + ∑ a ki e i  − x k
k =1 
i =1

d′
r
r
r
where x = m + ∑ a i e i , such that ei’s are eigenvectors
n
i =1
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Principal Component Analysis
r
r
Remember n vectors in d-dimensions: X = [x 1 , K , x n ]
Note difficulty during calculation of S, if d>>n (S is dxd)
n
r
r r
r t
S ≡ ∑ ( x k − m )( x k − m ) = XX t
k =1
instead of solving Se=λe or XXte=λe, try solving
r
r
X X f = λ f
t
multiply
by X from left
⇒
r
r
X X X f = λ X f
t
Note that XXt is dxd, whereas XtX is nxn
r
r
r
t r
X X (X f ) = λ (X f ) ⇔ X X e = λe
r
r
⇒ X f = e
t
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (1/8)
The Fisher’s approach aims to project d-dimensional
data onto a line (1-D), which is defined by w
The projected data is expected to be well separated
between two classes after such a dimension
reduction
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (2/8)
rt r
Feature vector projections : yi = w xi
i = 1, K , n
Measures for separation based on w :
Difference between projection means
Variance of within-class projection data
Choose projection (w) in order to maximize J
( m1 − m 2 ) 2
J (• ) =
s12 + s 22
where mi : projection means for class i
si2 =
2
(
)
y
−
m
: scatter
∑
i
y∈Yi
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (3/8)
Relation between sample & projection means :
r
1
mi =
ni
r
1
x ⇒ mi =
∑
ni
x∈ℵi
1
y=
∑
ni
y∈Yi
rt r rt r
∑ w x = w mi
x∈ℵi
Define scatter matrices Si
r r r r T
Si = ∑ ( x − mi )(x − mi )
x∈ℵi
and
SW = S1 + S 2
Note that si and Si are related as
s = ∑ ( y − mi ) =
2
i
2
y∈Yi
∑(
x∈ℵi
rT r rT r
w x − w mi
)
2
rT r r r r T r rT r
= ∑ w (x − mi )( x − mi ) w = w Si w
x∈ℵi
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (4/8)
Similarly, the relation between m1 and m2 becomes
rT r
rT r 2 rT r
r
r
r T r
( m1 − m 2 ) = ( w m1 − w m 2 ) = w ( m1 − m 2 )( m1 − m 2 ) w
rT r
= w S B w (Note that S has rank 1)
( m1 − m 2 ) 2
The initial criterion function :
J (•) =
s12 + s 22
rT r
r
w SBw
This function can be written as J ( w ) = r T
r
w SW w
r
r
w vector maximizes J must satisfy S B w = λ SW w
2
B
(see distributed notes for its proof)
If SW is non-singular, then
S
−1
W
r
r
SBw = λ w
{
direction
r r
m1 − m 2
⇒
r
r
−1 r
w = SW ( m1 − m 2 )
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (5/8)
For a 2-class problem, d-dimensional data is projected on a line
As an extension to c-class problem, it is possible to project
data onto (c-1)-dimensions, instead of a line.
For (c-1)-dimensions :
rT r
yi = wi x , i = 1, K c − 1 ⇒
r
Tr
y =W x
Define new scatter matrices in d-dimensional space
r r r r T
ST = ∑ ( x − m )( x − m )
r
x∈Whole
c
, SW = ∑ Si
i =1
c
r r r r r r r r T
= ∑ ∑ ( x − mi + mi − m )(x − mi + mi − m )
r
i =1 x∈Di
= SW + S B
c
r r r r T
where S B = ∑ ni (mi − m )(mi − m )
i =1
(Note that SB has rank c-1)
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (6/8)
r
Tr
Remember the samples are projected by y = W x
Resulting projected mean vectors in (c-1)-dimensions :
~r
1
mi =
ni
~r 1 c ~r
r
y , m = ∑ ni mi
∑
r
n i =1
y∈Yi
~r
1
⇒ mi =
ni
T r
T r
∑W x = W mi
x∈ℵi
~r 1 c
T r
T r
, m = ∑ niW mi = W m
n i =1
Scatter matrices in (c-1)-dimensions can defined as
(
)(
)
(
)(
)
c
c
~r ~r ~r ~r T
r ~r r ~r T ~
~
SW = ∑ ∑ y − mi y − mi , S B = ∑ ni mi − m mi − m
r
i =1 y∈Yi
i =1
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (7/8)
Scatter matrices in the projected space are
(
)(
)
(
)(
)
c
c
~r ~r ~r ~r T
r ~r r ~r T ~
~
SW = ∑ ∑ y − mi y − mi , S B = ∑ ni mi − m mi − m
r
i =1 y∈Yi
i =1
Relation between scatter matrices are equal to
(
)(
c
r ~r r ~r
~
SW = ∑ ∑ y − mi y − mi
r
i =1 y∈Yi
c
(
)
T
)(
r
T r
Tr
T r
= ∑ ∑ W x − W mi W x − W mi
r
i =1 x∈ℵi
T
(
)(
)
T
= W T SW W ,
)
c
~r ~r ~r ~r T
~
S B = ∑ ni mi − m mi − m
i =1
c
(
)(
)
r
T r
T r
T r T
= ∑ ni W mi − W m W mi − W m = W T S BW
i =1
T
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014
Fisher’s Linear Discriminant (8/8)
Relation between scatter matrices are obtained as
~
~
T
SW = W SW W , S B = W T S BW
For better discrimination in the projected space:
~
~
min | SW | & max | S B |
. : determinant
~
| SB |
| W T S BW |
⇒ J (•) = ~
⇒ J (W ) =
T
| W SW W |
| SW |
Note that determinant is product of scatter along principal directions
Solution for J(W) : Columns of the optimal W are
generalized (c-1) eigenvectors
to the
r that correspond
r
largest eigenvalues of S B wi = λi SW wi
```