Next-gen Kaldi: training and decoding for LibriSpeech dataset.

[Done for educational purposes, not for best results.]

All scripts were taken from RESULTS.md: https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

Training

The original training commands from RESULTS.md:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./pruned_transducer_stateless5/train.py \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --full-libri 1 \
  --exp-dir pruned_transducer_stateless5/exp-M \
  --max-duration 300 \
  --use-fp16 0 \
  --num-encoder-layers 18 \
  --dim-feedforward 1024 \
  --nhead 4 \
  --encoder-dim 256 \
  --decoder-dim 512 \
  --joiner-dim 512

I only have two NVIDIA GeForce RTX 3080 - 10GB GDDR6X (VR-Ready) cards on my Ubuntu Machine, so my training script is slightly different from original one:

./pruned_transducer_stateless5/train.py \
  --world-size 2 \
  --num-epochs 10 \
  --start-epoch 1 \
  --full-libri 1 \
  --exp-dir pruned_transducer_stateless5/exp-M \
  --max-duration 300 \
  --use-fp16 0 \
  --num-encoder-layers 18 \
  --dim-feedforward 1024 \
  --nhead 4 \
  --encoder-dim 256 \
  --decoder-dim 512 \
  --joiner-dim 512
  • I don’t need export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7". When we don’t specify this command the script will use all GPUs on a given machine. In my case it will use both GPU cards.
  • Also --world-size 2 changed from 8 to 2 as I only have 2 GPUs.
  • --num-epochs 10 changed from 30 to 10 as I am doing this training for educational purposes. The results for 30 epochs are given in RESULTS.md file.

Decoding

  • Original decoding script:
for method in greedy_search modified_beam_search fast_beam_search; do
  ./pruned_transducer_stateless5/decode.py \
    --epoch 30 \
    --avg 10 \
    --exp-dir ./pruned_transducer_stateless5/exp-M \
    --max-duration 600 \
    --decoding-method $method \
    --max-sym-per-frame 1 \
    --num-encoder-layers 18 \
    --dim-feedforward 1024 \
    --nhead 4 \
    --encoder-dim 256 \
    --decoder-dim 512 \
    --joiner-dim 512 \
    --use-averaged-model True
done
  • Adjusted decoding script for 2 GPUs and only for 10 epochs:
for method in greedy_search modified_beam_search fast_beam_search; do
  ./pruned_transducer_stateless5/decode.py \
    --epoch 10 \
    --avg 2 \
    --exp-dir ./pruned_transducer_stateless5/exp-M \
    --max-duration 600 \
    --decoding-method $method \
    --max-sym-per-frame 1 \
    --num-encoder-layers 18 \
    --dim-feedforward 1024 \
    --nhead 4 \
    --encoder-dim 256 \
    --decoder-dim 512 \
    --joiner-dim 512 \
    --use-averaged-model True
done

Results

Results for 10 epochs of training on 2 GPU’s. It took me 3 days of training on two NVIDIA GeForce RTX 3080 - 10GB GDDR6X (VR-Ready) cards. It is recommended that you try different values for --avg variable. The default script is averaging 10 last epochs. As I only trained my model for 10 epochs, it looks like that 2 last epochs averaging was the best for my case. The modified beam search gave me the best results for my model.

  • Fast Beam Search
epochs
avg
max-duration
WER test_clean
WER test_other
9
5
100
3.74
8.96
10
5
600
3.57
8.47
10
4
600
3.52
8.33
10
2
600
3.39
8.11
10
1
600
3.37
8.15
  • Greedy Search
epochs
avg
max-duration
WER test_clean
WER test_other
9
5
100
3.82
9.09
10
5
600
3.61
8.67
10
4
600
3.53
8.50
10
2
600
3.43
8.33
10
1
600
3.43
8.26
  • Modified Beam Search
epochs
avg
max-duration
WER test_clean
WER test_other
9
5
100
3.72
8.97
10
5
600
3.55
8.53
10
4
600
3.46
8.35
10
2
600
3.35
8.11
10
1
600
3.34
8.14

Try commands below if nvidia-smi command not detecting GPU cards. Look for nvidia driver and try to install it one more time.

nvidia-smi
dpkg -l | grep -i nvidia | less
sudo apt-get install nvidia-dkms-510