Next-gen Kaldi: what is it?

Blog was created from Daniel Povey’s speech at BAAI Conference on youtube.

The Next-gen Kaldi describes this cluster of three different open source projects. These are just the main ones, we have a bunch of smaller ones that kind of implement extra features for them. Lhotse is all about data preparation for speech recognition. In speech recognition the data preparation is much more complicated than let's say computer vision and things like that because the utterances are all different lengths, there's all kinds of issues about how to put them in batches and things like that. We've made a whole repository about that to keep it separate from everything else. Then K2 is a bunch of core algorithms in C++, things like lattices and finite state transducers and then Icefall is the recipes. Right now most of our effort is going into Icefall.

Icefall is developing very rapidly, it's not super stable. We change the recipes every week or two but we've structured it in such a way that we can make rapid progress. Later on we'll clean it up a little bit.

Current recipes

Right now in the recipes in Icefall we're mostly working with RNN-T and technically it's not an RNN. It's not because we don't really use LSTMs right now, we're mostly using transformers. So it's a technically it's a transducer but we use the word RNN-T because that's what it's called.

Icefall doesn't have any Kaldi dependency it just depends on PyTorch, although it does depend on these other repositories K2 and Lhotse.

Right now we put a lot of effort into fast RNN-T training. We're focused on things that can be practical to build real speech recognition systems and we're also working on another project called Sherpa, that you can build a server that does real-time recognition but this is in a very early stage.

Next-Gen Kaldi vs. other toolkits

To compare Next-gen Kaldi with other popular toolkits probably most popular python-based toolkits for speech recognition right now are SpeechBrain, ESPNet and WeNet, although there are others. Compared to SpeechBrain and ESPNet we're much more focused on just speech recognition. Those projects are have quite a wide range of interests including TTS and speech separation. Right now we're just focusing on ASR and our main focus is to be something practical that companies can use to build real speech recognition systems and I don't want to say that this is ready. This is not 100 ready yet because we have not finished all the issues about real-time recognition but we're working very hard on it right now. We're different from WeNet because we don't have any Kaldi dependency we're trying to build everything of from scratch. Although it's based on PyTorch so it's not totally from scratch. We're taking inspiration from these other projects too, we've used code from ESPNet from WeNet. So the basic vision of Next-gen Kaldi is

to have low latency uh speech recognition that's accurate that has high uh throughput so you can do the server inference very efficiently this is something that the other big toolkits SpeechBrain and ESPNet are not focused on at all as far as I can tell.

What is the vision?

We want to be able to use this in all of Xiaomi's products as well as being open source right. There's certain requirements apart from the high throughput and low latency we also need to support contextualization. For example, if you have names of contacts or you have a database of music song titles, you need to be able to somehow put that into the system and also to specialize it to particular products. Suppose you have an air conditioner or something that has certain commands. This is something that so-called end-to-end speech recognition is not very good at. People can get super impressive numbers on LibriSpeech with our RNN-T or with transformer transducers but when you try to actually build a product with these things you realize that there's actually a lot of problems with it.

I'm very committed to getting something working this is taking a lot longer than expected but at some point you know we're really going to have something working and I hope that's going to be by the end of the year.

Why the delay?

We were hoping that we would have already released something like a year ago and I've never been very good at estimating timelines. I didn't take into account how long it would take to build up a team and to learn the things needed because I was always coding in C++ and CUDA. I was never a very big python programmer. I had to learn a lot of things and also we've changed direction quite a lot because in the originally the plan was to keep most of the Kaldi code but use PyTorch for the neural net stuff. But in the end I decided that it was just going to be too much code, it was going to be too much legacy, I decided it was not a good plan.

The next plan was we built K2 with the intention of focusing mostly on finite-state-automata things like lattice free MMI, CTC but we we couldn't quite get the Word Error Rate results that we were hoping for from that. K2 is still it's still very useful because per start you can use finite-state-automata for a lot of things like lattices or data structure that you're going to need for any serious speech recognition system. and

Secondly the way we built K2, it's based on these primitives called the ragged tensors, which is a very useful data structure for implementing algorithms on GPU that have irregular structure. Let's say fast RNN-T, we used some utilities from K2 to build that. In RNN-T decoding we also use K2. It's just some very useful primitives there. So even though we switched focused a little bit from lattice-free MMIs to RNN-T, it's still very useful.