June 7, 2022
*These captions were created for Dan Povey’s interview with us: https://youtu.be/2ldYQM6jqP8
Q: i-vectors vs x-vectors
- i-vectors and x-vectors are both concepts from speaker recognition / speaker identification
- so it's basically a fixed dimensional vector if let's say the dimension 256 or 512 or something like that.
- it suppose to represent information about the speaker
- the original thing about i-victors was that you extract an i-vector from like just recording
- it contains information about both the speaker and the kind of recording conditions
- then you use other methods to separate those two variations like a PLDA and stuff.
- But for Kaldi purposes we mostly use i-vectors for a very basic form of speaker adaptation
- when we train a neural network we input the i-vector as a kind of extra input to the neural network and it helps it to adapt
- actually for the most part it just has a similar effect as mean normalization
- I regretted putting i-vector stuff in because you can get most of the improvement from giving it the mean of the features up to the present point
- x-vectors is a kind of a neural net version of i-vectors, where you are basically train neural net to discriminate between speakers.
- and inside the neural net it has, there is some kind of embedding layer, it's just before the classifier and you call that the x-vector so you can extract.
- basically it's a way of extracting a fixed dimensional feature from an utterance.
- now the thing with both i-vectors and x-vectors is that to train the classifier effectively to train the system that extract the i-vector or x-vector you need a very huge amount of data
- for i-vectors, ideally you want like 1K hours if it is for speaker identification purposes, for x-vectors, ideally you want 10K hours, for speech recognition it is not as critical, you can have 10 hours or 100 hours.
Q: Does Kaldi have x-vectors?
- there are a speaker recognition recipes in Kaldi, like if you look at SRE 16 things like that
- that's not for speech recognition though because there's no advantage of x-vectors over i-vectors for its application to speech recognition.
- we are just using it like I said for basic adaptation and we don't really need all of that discriminating power of x-vectors
- so answer is we are only using it for speaker recognition
- For speaker identification, to train the system that extracts the i-vectors, we ideally want 1 000 hours of data.
- For speaker identification, to train the system that extracts x-vectors, we ideally want 10 000 hours.
- For speech recognition it is OK to have 10 hours or 100 hours of data.
References:
youtube video:
X-Vectors: Robust DNN Embeddings for Speaker Recognition:
https://ieeexplore.ieee.org/abstract/document/8461375
OnlineIvectorFeature Class Reference
https://kaldi-asr.org/doc/classkaldi_1_1OnlineIvectorFeature.html#af7c4234c6b1d5d807dbb4292cf36b98c
GBO notes: i-vectors and x-vectors:
The blog explains 2 vectors very well
https://desh2608.github.io/2022-04-07-gbo-ivectors/
#3 i-vectors youtube auto transcript