- Step 0: Make a copy of run.sh
- Step 1: Install flac
- Step 2: Change cmd.sh
- Step 3: Needed if only one GPU is used
- Step 4: Create a screen to run the script.
- Step 5: Run run_edited.sh
LibriSpeech training will need at least one GPU and 400GB storage space. The tutorial below was created using AWS instance.
Step 0: Make a copy of run.sh
Normally I copy the run.sh to new file that I can edit. But we can edit original run.sh if wanted.
cp run.sh run_edited.sh
Need to edit path to where we want to put all downloaded folders
nano run_edited.sh
Edit the new version. Add few lines after line 7. Pick a data folder location that you want. I chose the one below.
mkdir -p nadira/data
data=nadira/data
Line 17 was data=/export/a15/vpanayotov/data
, If you don’t have that folder you can create it, or just make a data folder.
run_edited.sh
top few lines:
#!/usr/bin/env bash
# Set this to somewhere where you want to put your data, or where
# someone else has already put it. You'll want to change this
# if you're not on the CLSP grid.
# data=/export/a15/vpanayotov/data
mkdir -p nadira/data # <--- added
data=nadira/data # <--- added
# base url for downloads.
data_url=www.openslr.org/resources/12
lm_url=www.openslr.org/resources/11
Step 1: Install flac
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5$ sudo apt install flac
Step 2: Change cmd.sh
Comment out the last 3 lines and add three lines that you see below. As AWS doesn’t have qsub package we can’t use the last 3 lines.
# you can change cmd.sh depending on what type of queue you are using.
# If you have no queueing system and want to run on a local machine, you
# can change all instances 'queue.pl' to run.pl (but be careful and run
# commands one by one: most recipes will exhaust the memory on your
# machine). queue.pl works with GridEngine (qsub). slurm.pl works
# with slurm. Different queues are configured differently, with different
# queue names and different ways of specifying things like memory;
# to account for these differences you can create and edit the file
# conf/queue.conf to match your queue's configuration. Search for
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.
# comment out line below as we don't have qsub on AWS
# export train_cmd="queue.pl --mem 2G"
# export decode_cmd="queue.pl --mem 4G"
# export mkgraph_cmd="queue.pl --mem 8G"
# my instance has 8 vCPUs
export train_cmd="run.pl --max-jobs-run 8"
export decode_cmd="run.pl --max-jobs-run 8"
export mkgraph_cmd="run.pl --max-jobs-run 8"
Step 3: Needed if only one GPU is used
Need to change the option for nvidia-smi otherwise it will assume that we have multiple GPUs. My AWS instance only has one GPU.
sudo nvidia-smi -c 3
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5$ pico local/chain/run_tdnn.sh
add --use-gpu=wait \
steps/nnet3/chain/train.py --stage $train_stage \
--cmd "$decode_cmd" \
--use-gpu=wait \
--feat.online-ivector-dir $train_ivector_dir \
--feat.cmvn-opts "--norm-means=false --norm-vars=false" \
--chain.xent-regularize $xent_regularize \
--chain.leaky-hmm-coefficient 0.1 \
--chain.l2-regularize 0.0 \
--chain.apply-deriv-weights false \
--chain.lm-opts="--num-extra-lm-states=2000" \
--egs.dir "$common_egs_dir" \
--egs.stage $get_egs_stage \
--egs.opts "--frames-overlap-per-eg 0 --constrained false" \
--egs.chunk-width $frames_per_eg \
--trainer.dropout-schedule $dropout_schedule \
--trainer.add-option="--optimization.memory-compression-level=2" \
--trainer.num-chunk-per-minibatch 64 \
--trainer.frames-per-iter 2500000 \
--trainer.num-epochs 4 \
--trainer.optimization.num-jobs-initial 3 \
--trainer.optimization.num-jobs-final 16 \
--trainer.optimization.initial-effective-lrate 0.00015 \
--trainer.optimization.final-effective-lrate 0.000015 \
--trainer.max-param-change 2.0 \
--cleanup.remove-egs $remove_egs \
--feat-dir $train_data_dir \
--tree-dir $tree_dir \
--lat-dir $lat_dir \
--dir $dir || exit 1;
If step 3 is not done when using only one GPU we will get error below:
ERROR (nnet3-chain-train[5.5.997~1-054af]:AllocateNewRegion():cu-allocator.cc:491) Failed to allocate a memory region of 12582912 bytes. Possibly this is due to sharing the GPU. Try switching the GPUs to exclusive mode (nvidia-smi -c 3) and using the option --use-gpu=wait to scripts like steps/nnet3/chain/train.py. Memory info: free:10M, used:15099M, total:15109M, free/total:0.000711461 CUDA error: 'out of memory'
Step 4: Create a screen to run the script.
screen -S libri
Step 5: Run run_edited.sh
./run_edited.sh > output.txt
The following results can be observed after 10 days of training. My instance has 8 vCPUs and 1GPU.
Training is completed. We can find final Neural Net system in exp/chain_cleaned/tdnn_1d_sp/final.mdl
and graph in exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/HCLG.fst
2022-01-09 02:15:39,486 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 496/520 Jobs: 15 Epoch: 3.67/4.0 (91.8% complete) lr: 0.000272
2022-01-09 02:50:23,893 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 497/520 Jobs: 15 Epoch: 3.68/4.0 (92.1% complete) lr: 0.000270
2022-01-09 03:25:41,148 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 498/520 Jobs: 15 Epoch: 3.69/4.0 (92.4% complete) lr: 0.000268
2022-01-09 04:01:32,203 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 499/520 Jobs: 15 Epoch: 3.71/4.0 (92.7% complete) lr: 0.000266
2022-01-09 04:37:38,180 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 500/520 Jobs: 15 Epoch: 3.72/4.0 (93.0% complete) lr: 0.000264
2022-01-09 05:14:40,809 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 501/520 Jobs: 16 Epoch: 3.73/4.0 (93.3% complete) lr: 0.000280
2022-01-09 05:53:18,670 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 502/520 Jobs: 16 Epoch: 3.74/4.0 (93.6% complete) lr: 0.000278
2022-01-09 06:31:05,279 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 503/520 Jobs: 16 Epoch: 3.76/4.0 (93.9% complete) lr: 0.000276
2022-01-09 07:08:46,082 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 504/520 Jobs: 16 Epoch: 3.77/4.0 (94.2% complete) lr: 0.000274
2022-01-09 07:47:07,220 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 505/520 Jobs: 16 Epoch: 3.78/4.0 (94.6% complete) lr: 0.000272
2022-01-09 08:23:57,390 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 506/520 Jobs: 16 Epoch: 3.80/4.0 (94.9% complete) lr: 0.000270
2022-01-09 09:02:17,142 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 507/520 Jobs: 16 Epoch: 3.81/4.0 (95.2% complete) lr: 0.000268
2022-01-09 09:40:52,534 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 508/520 Jobs: 16 Epoch: 3.82/4.0 (95.5% complete) lr: 0.000266
2022-01-09 10:18:47,231 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 509/520 Jobs: 16 Epoch: 3.83/4.0 (95.9% complete) lr: 0.000264
2022-01-09 10:57:05,727 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 510/520 Jobs: 16 Epoch: 3.85/4.0 (96.2% complete) lr: 0.000262
2022-01-09 11:35:18,583 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 511/520 Jobs: 16 Epoch: 3.86/4.0 (96.5% complete) lr: 0.000260
2022-01-09 12:12:56,573 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 512/520 Jobs: 16 Epoch: 3.87/4.0 (96.8% complete) lr: 0.000258
2022-01-09 12:51:06,920 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 513/520 Jobs: 16 Epoch: 3.89/4.0 (97.2% complete) lr: 0.000256
2022-01-09 13:29:19,736 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 514/520 Jobs: 16 Epoch: 3.90/4.0 (97.5% complete) lr: 0.000254
2022-01-09 14:07:52,940 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 515/520 Jobs: 16 Epoch: 3.91/4.0 (97.8% complete) lr: 0.000252
2022-01-09 14:46:16,641 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 516/520 Jobs: 16 Epoch: 3.92/4.0 (98.1% complete) lr: 0.000251
2022-01-09 15:24:13,838 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 517/520 Jobs: 16 Epoch: 3.94/4.0 (98.4% complete) lr: 0.000249
2022-01-09 16:01:52,900 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 518/520 Jobs: 16 Epoch: 3.95/4.0 (98.8% complete) lr: 0.000247
2022-01-09 16:40:28,511 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 519/520 Jobs: 16 Epoch: 3.96/4.0 (99.1% complete) lr: 0.000245
2022-01-09 17:18:47,249 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 520/520 Jobs: 16 Epoch: 3.98/4.0 (99.4% complete) lr: 0.000240
2022-01-09 17:58:31,315 [steps/nnet3/chain/train.py:585 - train - INFO ] Doing final combination to produce final.mdl
2022-01-09 17:58:31,315 [steps/libs/nnet3/train/chain_objf/acoustic_model.py:571 - combine_models - INFO ] Combining set([512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511]) models.
2022-01-09 17:59:36,259 [steps/nnet3/chain/train.py:614 - train - INFO ] Cleaning up the experiment directory exp/chain_cleaned/tdnn_1d_sp
tree-info exp/chain_cleaned/tdnn_1d_sp/tree
tree-info exp/chain_cleaned/tdnn_1d_sp/tree
fstcomposecontext --context-size=2 --central-position=1 --read-disambig-syms=data/lang_test_tgsmall/phones/disambig.int --write-disambig-syms=data/lang_test_tgsmall/tmp/disambig_ilabels_2_1.int data/lang_test_tgsmall/tmp/ilabels_2_1.2242 data/lang_test_tgsmall/tmp/LG.fst
fstisstochastic data/lang_test_tgsmall/tmp/CLG_2_1.fst
make-h-transducer --disambig-syms-out=exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/disambig_tid.int --transition-scale=1.0 data/lang_test_tgsmall/tmp/ilabels_2_1 exp/chain_cleaned/tdnn_1d_sp/tree exp/chain_cleaned/tdnn_1d_sp/final.mdl
fstrmsymbols exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/disambig_tid.int
fstminimizeencoded
fstrmepslocal
fsttablecompose exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/Ha.fst 'fstrmsymbols --remove-arcs=true --apply-to-output=true data/lang_test_tgsmall/oov.int data/lang_test_tgsmall/tmp/CLG_2_1.fst|'
fstdeterminizestar --use-log=true
fstrmsymbols --remove-arcs=true --apply-to-output=true data/lang_test_tgsmall/oov.int data/lang_test_tgsmall/tmp/CLG_2_1.fst
fstisstochastic exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/HCLGa.fst
add-self-loops --self-loop-scale=1.0 --reorder=true exp/chain_cleaned/tdnn_1d_sp/final.mdl exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/HCLGa.fst
fstisstochastic exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/HCLG.fst
fstrmsymbols data/lang_test_tgmed/phones/disambig.int
fstdeterminizestar
fstrmsymbols data/lang_test_tgmed/phones/disambig.int
fstdeterminizestar
fstdeterminizestar
fstrmsymbols data/lang_test_tgmed/phones/disambig.int
fstrmsymbols data/lang_test_tgmed/phones/disambig.int
fstdeterminizestar
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5$
graph folder:
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5/exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall$ ls
HCLG.fst disambig_tid.int num_pdfs phones phones.txt words.txt
ubuntu@ip-172-31-6-144:~/kaldi/egs/librispeech/s5/exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall$Fo
Comments:
https://github.com/npovey/speech/discussions/categories/announcements