#3 X-Vectors vs I-Vectors

June 7, 2022

*These captions were created for Dan Povey’s interview with us: https://youtu.be/2ldYQM6jqP8

Q: i-vectors vs x-vectors

  • i-vectors and x-vectors are both concepts from speaker recognition / speaker identification
  • so it's basically a fixed dimensional vector if let's say the dimension 256 or 512 or something like that.
  • it suppose to represent information about the speaker
  • the original thing about i-victors was that you extract an i-vector from like just recording
  • it contains information about both the speaker and the kind of recording conditions
  • then you use other methods to separate those two variations like a PLDA and stuff.
  • But for Kaldi purposes we mostly use i-vectors for a very basic form of speaker adaptation
  • when we train a neural network we input the i-vector as a kind of extra input to the neural network and it helps it to adapt
  • actually for the most part it just has a similar effect as mean normalization
  • I regretted putting i-vector stuff in because you can get most of the improvement from giving it the mean of the features up to the present point
  • x-vectors is a kind of a neural net version of i-vectors, where you are basically train neural net to discriminate between speakers.
  • and inside the neural net it has, there is some kind of embedding layer, it's just before the classifier and you call that the x-vector so you can extract.
  • basically it's a way of extracting a fixed dimensional feature from an utterance.
  • now the thing with both i-vectors and x-vectors is that to train the classifier effectively to train the system that extract the i-vector or x-vector you need a very huge amount of data
  • for i-vectors, ideally you want like 1K hours if it is for speaker identification purposes, for x-vectors, ideally you want 10K hours, for speech recognition it is not as critical, you can have 10 hours or 100 hours.

Q: Does Kaldi have x-vectors?

  • there are a speaker recognition recipes in Kaldi, like if you look at SRE 16 things like that
  • that's not for speech recognition though because there's no advantage of x-vectors over i-vectors for its application to speech recognition.
  • we are just using it like I said for basic adaptation and we don't really need all of that discriminating power of x-vectors
  • so answer is we are only using it for speaker recognition
  • For speaker identification, to train the system that extracts the i-vectors, we ideally want 1 000 hours of data.
  • For speaker identification, to train the system that extracts x-vectors, we ideally want 10 000 hours.
  • For speech recognition it is OK to have 10 hours or 100 hours of data.

References:

youtube video:

https://youtu.be/Ud3gjCL6HmY

X-Vectors: Robust DNN Embeddings for Speaker Recognition:

https://ieeexplore.ieee.org/abstract/document/8461375

OnlineIvectorFeature Class Reference

https://kaldi-asr.org/doc/classkaldi_1_1OnlineIvectorFeature.html#af7c4234c6b1d5d807dbb4292cf36b98c

GBO notes: i-vectors and x-vectors:

The blog explains 2 vectors very well

https://desh2608.github.io/2022-04-07-gbo-ivectors/

#3 i-vectors youtube auto transcript