What is BPE and lang_bpe

Watch on youtube: What is BPE and lang_bpe_500?

9/2/22 Interview with Daniel Povey

What is BPE?

BPE stands for Byte Pair Encoding but usually when we use that word in an ASR context we're talking about word pieces. Byte Pair Encoding is just a specific algorithm that is quite simple and becoming a popular way to break words into little pieces: one, two, or three letters. These days we want to have open vocabulary systems, we don’t want to be limited to a fixed vocabulary and that's especially important in languages like German that have a very open vocabulary.

We may not want to use words, but often we don't know how words are pronounced. We don't have a lexicon: mapping from phones to words. We make a dictionary kind of on the surface level of the characters by breaking each word up into a sequence of little pieces. There's a deterministic algorithm so that given a word you always know how to break it up into pieces.

Did Next-gen Kaldi switch from LM to BPE?

We're not getting rid of LMs necessarily. Traditionally we used to have this HCLG.fst, the traditional graph-based decoding. We had a lexicon that mapped phones to words and we had a language model on words. These days there's a lot more choices because some of these end-to-end models can decode quite well without the language model, but you can still combine a BPE with a language model.

So you have a choice to either build a language model on BPE units or build a language model on words that would require a fixed vocabulary. So it's kind of a waste because you're not getting the benefits of an open vocabulary then. You could build a language model on words and then build a lexicon that just maps words to the sequence of BPE pieces. You can still use a language model, the thing that we're getting rid of is the lexicon [the dictionary] because we don't need this word-to-phone mapping anymore.

You said benefits of an open vocabulary: what are those?

The benefit is that you're able to recognize words that you never saw in training, but of course the downside of it is that you could easily get things wrong because it's generally hard to recognize unseen words. For instance if I tell you some word you've never heard before you might have a hard time knowing how to spell it. It's kind of the same thing with ASR systems that they might get these things wrong. But these days the accuracy of the models is getting better and better so that we're actually starting to see benefits from having an open vocabulary system.

You sometimes would see a small benefit back when I was starting out back around the year you know 2000, when people tried these open vocabulary systems. It would always make the word error rate worse because any time you would recognize an unseen word it would almost always be wrong, but these days that's not the case anymore.

There are many types of BPE like 500 and 1000. What are the pros and cons and the differences between those?

So we may be talking here about our Icefall recipes that if we look at your screen there it's called lang_bpe_500, lang_bpe_1000, lang_bpe_2000 prepared by our data preparation script by default. We run the BPE algorithm with different numbers of target pieces. We tuned our systems and we tried different numbers but I think in the end we found that 500 worked the best for us. This is for English.

Mostly we just use the 500 but it is kind of depends because someone recently pointed out to me or told me that in a German system they were working on, they sometimes were getting some warnings from Icefall about transcripts that were too long relative to the number of frames. So they were running into a limitation of the system in certain configurations that can only output one symbol per frame. So these long complicated German words had so many letters in them because they were speaking so fast, requiring more than 25 symbols per second. They were running into that limitation so if that happens you would definitely want to increase the BPE size from 500 to 1000 or 5000 because then you're gonna have to output fewer of them per frame.

Why did you say that 500 was the best option for BEP size?

Well we just tuned it for some of our english systems and what we found is that we were getting better WERs from the 500 vocabulary size BPE.

For low resource languages would you pick a certain one?

I think it mostly depends on properties of the language but generally I'd want to pick a small number for a low resource language because if we don't have a lot of training data we're not going to be seeing many instances of any given one right? We don't want any super low count BPE pieces.

Could you walk us through these different documents in lang_bpe_500?

So now firstly, I want to mention that this lang_phone, this is kind of a Kaldi style language model directory. We're doing Kaldi style graphs and now not all of our recipes actually use this. We support various decoding modes in our RNN-T systems. In some decoding modes we don't use any graph or any language model at all and in some decoding modes we do. It doesn't always even help that we do use a language model but because we're trying to develop systems that interact better with language models we still keep this stuff around.

Let's look at the bpe.model first i think that's just a binary file, click on the bpe.model. That's just some binary file. HLG.pt also binary. HLG.pt that's a compiled graph in k2 format.

Let’s go to lexicon_disambig.text.

This looks like just a BPE lexicon. The underscore on the BPE piece means it's the start of the word. Again this is not something that's really necessary to decode with always, because if you have BPEs you don't have to be limited to a vocabulary but if you want to build language models that are based on words you have to do it this way.

You can click on lexicon.txt:

The only difference here is that you don't have those pound sign. One things at the end of the sentences those are to disambiguate in case two different things had the same decomposition into BPE pieces in fact I'm a little bit surprised that we have these pound sign ones in lexicon_disambig.text because the BPE decomposition is deterministic both ways you can't, you shouldn't be able to have two different words mapping to the same BPE.

Can we click again on lexicon_disambig.text? Okay let's look at this so ABBREVIATION versus ABBREVIATIONS.

ABBREVIATION ▁A B B RE V I ATION #1
ABBREVIATIONS ▁A B B RE V I ATION S

Okay I remember now. So these pound signs, one thing that they're not just for where you have two things with the same pronunciation like homophones, they're also for where one word is a prefix of another. So here “ABBREVIATION” is a word but you can also add an s and and to keep the thing determinable easily we need these pound signs for disambiguation.

Let's go to P.arpa

So I think the reason that we called this P.arpa the output format is a format for language models for n-gram language models. This is probably like a trigram language model I would guess. It must be at least bigram because I think the third field on those lines is the back off probability which we wouldn't have if it was a unigram.

What do these numbers mean?

Well I think those are the log of the backup probability the one on the right, the one on the left is actually the language model probability. These I think are uni-gram probabilities so meaning if you don't know the history or if you have a rare word that you've just seen this is the probability and they seem to be sorted by probability. So we've got the more probable word pieces at the top anyway the one on the left is the log of the language probability the one on the right means -these days nobody likes to use- log 10 but that was back. This is a very old format and at that time people still sometimes would use it. I guess it was invented by engineers or something so anyway this is a language model on BPE pieces like i said, there's many different coding methods that we use and then this one is for what you want for an open vocabulary. A simple language model on BPE pieces don't always get great results. Language models don't tend to work so well if the units are too small.

Let's go to P.fst.txt:

P.fst.txt. is created from P.arpa. It's an fst format of the language model. There's a way you could turn language models into finite state transducers. It's all described in the Springer handbook paper by Mehryar Mohri, chapter on fsts. I think the first and second fields are state ids, the next two are symbol numbers in `tokens.txt` and the last one is a log, a negative of a log probability and it's in log base `e` so these are a little bit larger than the ones in P.arpa but there is actually one-to-one correspondence between these things and the ones in P.arpa.

Let's go to tokens.txt.

So these are the BPE pieces that the algorithm created. The first three ones are like special ones that we told it to add. So those wouldn't have been created by the algorithm but they're just put in there by us. They're: blank, start of sentence slash end of sentence, and meaning unknown.

<blk> 0
<sos/eos> 1
<unk> 2

I think we don't always use this <unk> thing. We may have that there for historical reasons.

Let's go to transcript_tokens.txt

This is the training transcripts that have been turned into BPE pieces. So if you look at these you can see that each underscore corresponds to the start of a word. The reason this exists here is that we created this when we wanted to build a language model. We don't actually need this at this point. We could have deleted it but we kept it around so you can see.

Let's go to transcript_words.txt

This is going to be the same thing but in whole words instead of BPE pieces. Everything is a bit old-fashioned because of LibriSpeech. These free audio books they're all like public domain stuff so because of universe copyright law being super long lasting they have to be more than 80 years old or sometimes more than like 100 years old so that's why it's really old-fashioned.

Let's go to words.txt.

Let's go to words.txt. This is just a list of all the words that are present in LibriSpeech training transcripts and each one has a number. These numbers will be present in L.pt and the lexicon. But that won't be visible by humans. I think we haven't written to disc the text for a matter of fact. So that's done.

There used to be HCLG.fst and now we're not seeing the C anymore, where did the C go?

Good question so the C is for context. So back when I was starting out in fact you know for decades we had these context dependent phones. So the idea is that the English language has let's say 40 or 30 phones. Sounds like [s], [k], [u:], [ɑ:] but they're sometimes pronounced differently depending on what the previous and the next phone are. So we used to have this concept called triphones which is a phone together with the left and right context and then we would cluster the triphones. That helped that whole process to decode that we created this thing called C that maps from phones to context dependent phones, but these days when people build neural net based systems they often don't use this context dependency because it's not so helpful.

Did we replace that with P?

No P isn't really a replacement for C. P is just a language model on BPE pieces in this case. This is when you don't want to build a whole HCLG.fst, you just want to have one thing that's just a language model directly on the BPE pieces. I'm not saying that this is necessarily the best way to decode, it's just one thing that we were investigating as one mode that we can decode in.

References:

Speech Recognition with weighted finite-state transducers: https://cs.nyu.edu/~mohri/pub/hbka.pdf
What is HCLG.fst?: https://nadirapovey.blogspot.com/2021/12/what-is-hclgfst.html
K2: Icefall project: https://github.com/k2-fsa/icefall
Decoding graph (HCLG.fst) construction in Kaldi: https://kaldi-asr.org/doc/graph.html