Next-gen Kaldi: training and decoding for LibriSpeech dataset.

[Done for educational purposes, not for best results.]

Training
Decoding
Results

All scripts were taken from RESULTS.md: https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

Training

The original training commands from RESULTS.md:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./pruned_transducer_stateless5/train.py \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --full-libri 1 \
  --exp-dir pruned_transducer_stateless5/exp-M \
  --max-duration 300 \
  --use-fp16 0 \
  --num-encoder-layers 18 \
  --dim-feedforward 1024 \
  --nhead 4 \
  --encoder-dim 256 \
  --decoder-dim 512 \
  --joiner-dim 512

I only have two NVIDIA GeForce RTX 3080 - 10GB GDDR6X (VR-Ready) cards on my Ubuntu Machine, so my training script is slightly different from original one:

./pruned_transducer_stateless5/train.py \
  --world-size 2 \
  --num-epochs 10 \
  --start-epoch 1 \
  --full-libri 1 \
  --exp-dir pruned_transducer_stateless5/exp-M \
  --max-duration 300 \
  --use-fp16 0 \
  --num-encoder-layers 18 \
  --dim-feedforward 1024 \
  --nhead 4 \
  --encoder-dim 256 \
  --decoder-dim 512 \
  --joiner-dim 512

I don’t need export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7". When we don’t specify this command the script will use all GPUs on a given machine. In my case it will use both GPU cards.
Also --world-size 2 changed from 8 to 2 as I only have 2 GPUs.
--num-epochs 10 changed from 30 to 10 as I am doing this training for educational purposes. The results for 30 epochs are given in RESULTS.md file.

Decoding

Original decoding script:

for method in greedy_search modified_beam_search fast_beam_search; do
  ./pruned_transducer_stateless5/decode.py \
    --epoch 30 \
    --avg 10 \
    --exp-dir ./pruned_transducer_stateless5/exp-M \
    --max-duration 600 \
    --decoding-method $method \
    --max-sym-per-frame 1 \
    --num-encoder-layers 18 \
    --dim-feedforward 1024 \
    --nhead 4 \
    --encoder-dim 256 \
    --decoder-dim 512 \
    --joiner-dim 512 \
    --use-averaged-model True
done

Adjusted decoding script for 2 GPUs and only for 10 epochs:

for method in greedy_search modified_beam_search fast_beam_search; do
  ./pruned_transducer_stateless5/decode.py \
    --epoch 10 \
    --avg 2 \
    --exp-dir ./pruned_transducer_stateless5/exp-M \
    --max-duration 600 \
    --decoding-method $method \
    --max-sym-per-frame 1 \
    --num-encoder-layers 18 \
    --dim-feedforward 1024 \
    --nhead 4 \
    --encoder-dim 256 \
    --decoder-dim 512 \
    --joiner-dim 512 \
    --use-averaged-model True
done

Results

Results for 10 epochs of training on 2 GPU’s. It took me 3 days of training on two NVIDIA GeForce RTX 3080 - 10GB GDDR6X (VR-Ready) cards. It is recommended that you try different values for --avg variable. The default script is averaging 10 last epochs. As I only trained my model for 10 epochs, it looks like that 2 last epochs averaging was the best for my case. The modified beam search gave me the best results for my model.

Fast Beam Search

epochs	avg	max-duration	WER test_clean	WER test_other
9	5	100	3.74	8.96
10	5	600	3.57	8.47
10	4	600	3.52	8.33
10	2	600	3.39	8.11
10	1	600	3.37	8.15

Greedy Search

epochs	avg	max-duration	WER test_clean	WER test_other
9	5	100	3.82	9.09
10	5	600	3.61	8.67
10	4	600	3.53	8.50
10	2	600	3.43	8.33
10	1	600	3.43	8.26

Modified Beam Search

epochs	avg	max-duration	WER test_clean	WER test_other
9	5	100	3.72	8.97
10	5	600	3.55	8.53
10	4	600	3.46	8.35
10	2	600	3.35	8.11
10	1	600	3.34	8.14

Try commands below if nvidia-smi command not detecting GPU cards. Look for nvidia driver and try to install it one more time.

nvidia-smi
dpkg -l | grep -i nvidia | less
sudo apt-get install nvidia-dkms-510