#3 X-Vectors vs I-Vectors

June 7, 2022

*These captions were created for Dan Povey’s interview with us: https://youtu.be/2ldYQM6jqP8

Q: i-vectors vs x-vectors

i-vectors and x-vectors are both concepts from speaker recognition / speaker identification
so it's basically a fixed dimensional vector if let's say the dimension 256 or 512 or something like that.
it suppose to represent information about the speaker
the original thing about i-victors was that you extract an i-vector from like just recording
it contains information about both the speaker and the kind of recording conditions
then you use other methods to separate those two variations like a PLDA and stuff.
But for Kaldi purposes we mostly use i-vectors for a very basic form of speaker adaptation
when we train a neural network we input the i-vector as a kind of extra input to the neural network and it helps it to adapt
actually for the most part it just has a similar effect as mean normalization
I regretted putting i-vector stuff in because you can get most of the improvement from giving it the mean of the features up to the present point
x-vectors is a kind of a neural net version of i-vectors, where you are basically train neural net to discriminate between speakers.
and inside the neural net it has, there is some kind of embedding layer, it's just before the classifier and you call that the x-vector so you can extract.
basically it's a way of extracting a fixed dimensional feature from an utterance.
now the thing with both i-vectors and x-vectors is that to train the classifier effectively to train the system that extract the i-vector or x-vector you need a very huge amount of data
for i-vectors, ideally you want like 1K hours if it is for speaker identification purposes, for x-vectors, ideally you want 10K hours, for speech recognition it is not as critical, you can have 10 hours or 100 hours.

Q: Does Kaldi have x-vectors?

there are a speaker recognition recipes in Kaldi, like if you look at SRE 16 things like that
that's not for speech recognition though because there's no advantage of x-vectors over i-vectors for its application to speech recognition.
we are just using it like I said for basic adaptation and we don't really need all of that discriminating power of x-vectors
so answer is we are only using it for speaker recognition
For speaker identification, to train the system that extracts the i-vectors, we ideally want 1 000 hours of data.
For speaker identification, to train the system that extracts x-vectors, we ideally want 10 000 hours.
For speech recognition it is OK to have 10 hours or 100 hours of data.

References:

youtube video:

https://youtu.be/Ud3gjCL6HmY

X-Vectors: Robust DNN Embeddings for Speaker Recognition:

https://ieeexplore.ieee.org/abstract/document/8461375

OnlineIvectorFeature Class Reference

https://kaldi-asr.org/doc/classkaldi_1_1OnlineIvectorFeature.html#af7c4234c6b1d5d807dbb4292cf36b98c

GBO notes: i-vectors and x-vectors:

The blog explains 2 vectors very well

https://desh2608.github.io/2022-04-07-gbo-ivectors/

#3 i-vectors youtube auto transcript