0:01
hello this is daniel povey and today
0:03
we're asking him what's the difference
0:04
between eye vectors and x vectors
0:08
okay so
0:10
i vectors and x vectors are both
0:12
concepts from uh speaker recognition
0:14
meaning like speaker identification so
0:19
it's basically a fixed dimensional
0:20
vector of let's say the dimension 256 or
0:23
512 or something like that
0:25
the the
0:27
it's supposed to represent the
0:28
information about the speaker
0:31
uh
0:32
but the uh
0:34
the special thing the original thing
0:35
about eye vectors was
0:37
the you extract an eye vector from like
0:40
just a recording
0:42
and it
0:43
it contains information about both the
0:45
speaker and the uh
0:47
the kind of recording conditions
0:49
and then you use other methods to
0:51
separate the to separate those two
0:53
sources of variation like plda and stuff
0:56
but for cali purposes
0:58
we mostly use eye vectors
1:01
for uh
1:02
a very basic form of speaker adaptation
1:04
so that when we train a neural network
1:06
we input the eye vector as a kind of
1:09
extra input to the neural network and it
1:12
helps it to adapt and actually
1:15
for the most part it just acts
1:18
it has a similar effect to just like
1:19
mean normalization or something like
1:21
that because it can use the eye vector
1:23
to figure out
1:25
you know what's roughly the mean of the
1:27
uh the input feature so actually in the
1:29
end i kind of regretted in
1:32
regretting putting the eye vector stuff
1:34
in because you can get most of the
1:36
improvement just from uh
1:39
giving it the mean of the features up
1:41
till the present point
1:42
so anyway so that's what i vectors are
1:44
now x vectors is a kind of
1:47
a neural net version of i vectors where
1:50
uh
1:51
you basically train a neural net to
1:52
discriminate between speakers
1:55
and inside the neural net it has
1:58
there's some kind of embedding layer
1:59
that's just before the classifier
2:02
and and you call that the x vector so
2:04
you can extract basically it's a way of
2:06
extracting a fixed dimensional feature
2:08
from an occurrence now the thing with
2:10
both eye vectors and x vectors is that
2:12
to train the
2:14
the classifier effectively to train the
2:17
system that extracts the eye vector or
2:19
or the x factor you need a very huge
2:21
amount of data so for eye vectors
2:24
ideally
2:25
you want like a thousand hours or
2:27
something
2:28
if it's for speaker identification
2:30
purposes and for x vectors like ideally
2:33
you want something like 10 000 hours
2:35
which is a bit ridiculous now for speech
2:37
recognition
2:39
it's not as critical so it's fine if you
2:41
have just like 10 hours or 100 hours
2:44
because we're not really using it for
2:45
speaker identification we're just using
2:47
it for a basic
2:49
form of adaptation so
2:51
it's not so critical
2:53
okay so does quality use x vectors at
2:55
all
2:57
uh well
2:59
there are uh
3:00
speaker recognition recipes in caldi
3:02
like if you look at sre 16
3:05
things like that
3:07
uh though those that's not for speech
3:09
recognition though because
3:11
uh there's no advantage of x vectors
3:14
over i vectors for uh
3:16
for its application to speech
3:17
recognition we're just using it like i
3:19
said for basic adaptation
3:21
and we don't really need all of that uh
3:23
discriminating power of x vectors so
3:25
answer is we're using it only for speed
3:28
for speaker recognition
3:30
okay thank you
3:32
thank you