[Done for educational purposes, not for best results.]
All scripts were taken from RESULTS.md: https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md
Training
The original training commands from RESULTS.md:
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
./pruned_transducer_stateless5/train.py \
--world-size 8 \
--num-epochs 30 \
--start-epoch 1 \
--full-libri 1 \
--exp-dir pruned_transducer_stateless5/exp-M \
--max-duration 300 \
--use-fp16 0 \
--num-encoder-layers 18 \
--dim-feedforward 1024 \
--nhead 4 \
--encoder-dim 256 \
--decoder-dim 512 \
--joiner-dim 512
I only have two NVIDIA GeForce RTX 3080 - 10GB GDDR6X (VR-Ready)
cards on my Ubuntu Machine, so my training script is slightly different from original one:
./pruned_transducer_stateless5/train.py \
--world-size 2 \
--num-epochs 10 \
--start-epoch 1 \
--full-libri 1 \
--exp-dir pruned_transducer_stateless5/exp-M \
--max-duration 300 \
--use-fp16 0 \
--num-encoder-layers 18 \
--dim-feedforward 1024 \
--nhead 4 \
--encoder-dim 256 \
--decoder-dim 512 \
--joiner-dim 512
- I don’t need
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
. When we don’t specify this command the script will use all GPUs on a given machine. In my case it will use both GPU cards. - Also
--world-size 2
changed from 8 to 2 as I only have 2 GPUs. -
--num-epochs 10
changed from 30 to 10 as I am doing this training for educational purposes. The results for 30 epochs are given in RESULTS.md file.
Decoding
- Original decoding script:
for method in greedy_search modified_beam_search fast_beam_search; do
./pruned_transducer_stateless5/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir ./pruned_transducer_stateless5/exp-M \
--max-duration 600 \
--decoding-method $method \
--max-sym-per-frame 1 \
--num-encoder-layers 18 \
--dim-feedforward 1024 \
--nhead 4 \
--encoder-dim 256 \
--decoder-dim 512 \
--joiner-dim 512 \
--use-averaged-model True
done
- Adjusted decoding script for 2 GPUs and only for 10 epochs:
for method in greedy_search modified_beam_search fast_beam_search; do
./pruned_transducer_stateless5/decode.py \
--epoch 10 \
--avg 2 \
--exp-dir ./pruned_transducer_stateless5/exp-M \
--max-duration 600 \
--decoding-method $method \
--max-sym-per-frame 1 \
--num-encoder-layers 18 \
--dim-feedforward 1024 \
--nhead 4 \
--encoder-dim 256 \
--decoder-dim 512 \
--joiner-dim 512 \
--use-averaged-model True
done
Results
Results for 10 epochs of training on 2 GPU’s. It took me 3 days of training on two NVIDIA GeForce RTX 3080 - 10GB GDDR6X (VR-Ready)
cards. It is recommended that you try different values for --avg
variable. The default script is averaging 10 last epochs. As I only trained my model for 10 epochs, it looks like that 2 last epochs averaging was the best for my case. The modified beam search gave me the best results for my model.
- Fast Beam Search
epochs | avg | max-duration | WER test_clean | WER test_other |
9 | 5 | 100 | 3.74 | 8.96 |
10 | 5 | 600 | 3.57 | 8.47 |
10 | 4 | 600 | 3.52 | 8.33 |
10 | 2 | 600 | 3.39 | 8.11 |
10 | 1 | 600 | 3.37 | 8.15 |
- Greedy Search
epochs | avg | max-duration | WER test_clean | WER test_other |
9 | 5 | 100 | 3.82 | 9.09 |
10 | 5 | 600 | 3.61 | 8.67 |
10 | 4 | 600 | 3.53 | 8.50 |
10 | 2 | 600 | 3.43 | 8.33 |
10 | 1 | 600 | 3.43 | 8.26 |
- Modified Beam Search
epochs | avg | max-duration | WER test_clean | WER test_other |
9 | 5 | 100 | 3.72 | 8.97 |
10 | 5 | 600 | 3.55 | 8.53 |
10 | 4 | 600 | 3.46 | 8.35 |
10 | 2 | 600 | 3.35 | 8.11 |
10 | 1 | 600 | 3.34 | 8.14 |
Try commands below if nvidia-smi command not detecting GPU cards. Look for nvidia driver and try to install it one more time.
nvidia-smi
dpkg -l | grep -i nvidia | less
sudo apt-get install nvidia-dkms-510