To prepare the LibriSpeech dataset for Icefall recipe run the prepare.sh script before training the model
np@np-INTEL:/mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR$ ./prepare.sh
Output for prepare.sh:
np@np-INTEL:/mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR$ ls
conformer_ctc prepare.sh pruned_transducer_stateless6 transducer
conformer_mmi pruned_stateless_emformer_rnnt2 README.md transducer_lstm
conv_emformer_transducer_stateless pruned_transducer_stateless RESULTS-100hours.md transducer_stateless
distillation_with_hubert.sh pruned_transducer_stateless2 RESULTS.md transducer_stateless2
local pruned_transducer_stateless3 shared transducer_stateless_multi_datasets
medium_librispeech.sh pruned_transducer_stateless4 streaming_conformer_ctc
prepare_giga_speech.sh pruned_transducer_stateless5 tdnn_lstm_ctc
np@np-INTEL:/mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR$ ./prepare.sh
2022-07-10 00:48:42 (prepare.sh:60:main) dl_dir: /mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR/download
2022-07-10 00:48:42 (prepare.sh:63:main) Stage -1: Download LM
2022-07-10 00:48:42,820 INFO [download_lm.py:97] out_dir: /mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR/download/lm
Downloading /mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR/download/lm/3-gram.pruned.1e-7.arpa.gz: 32.5MB [00:04, 7.83MB/s]
Downloading /mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR/download/lm/4-gram.arpa.gz: 1.26GB [02:04, 10.9MB/s]
Downloading /mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR/download/lm/librispeech-vocab.txt: 1.66MB [00:02, 826kB/s]
Downloading /mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR/download/lm/librispeech-lexicon.txt: 5.37MB [00:02, 2.32MB/s]
Downloading /mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR/download/lm/librispeech-lm-norm.txt.gz: 1.40GB [02:48, 8.96MB/s]
Downloading LibriSpeech LM files: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:45<00:00, 69.11s/it]
2022-07-10 00:54:28 (prepare.sh:72:main) Stage 0: Download data
Downloading LibriSpeech parts: 0%| | 0/7 [00:00<?, ?it/s]2022-07-10 00:54:29,162 INFO [
librispeech.py:67] Processing split: dev-clean
Downloading dev-clean.tar.gz: 322MB [00:29, 11.6MB/s]
Downloading LibriSpeech parts: 14%|████████████▋ | 1/7 [00:30<03:02, 30.37s/it]2022-07-10 00:54:59,528 INFO [librispeech.py:67] Processing split: dev-other
Downloading dev-other.tar.gz: 300MB [00:19, 16.0MB/s]
Downloading LibriSpeech parts: 29%|█████████████████████████▍ | 2/7 [00:50<02:02, 24.47s/it]2022-07-10 00:55:19,877 INFO [librispeech.py:67] Processing split: test-clean
Downloading test-clean.tar.gz: 331MB [00:35, 9.64MB/s]
Downloading LibriSpeech parts: 43%|██████████████████████████████████████▏ | 3/7 [01:27<02:01, 30.26s/it]2022-07-10 00:55:57,015 INFO [librispeech.py:67] Processing split: test-other
Downloading test-other.tar.gz: 314MB [00:19, 16.7MB/s]
Downloading LibriSpeech parts: 57%|██████████████████████████████████████████████████▊ | 4/7 [01:48<01:19, 26.58s/it]2022-07-10 00:56:17,970 INFO [librispeech.py:67] Processing split: train-clean-100
Downloading train-clean-100.tar.gz: 5.95GB [06:00, 17.7MB/s]
Downloading LibriSpeech parts: 71%|██████████████████████████████████████████████████████████████▊ | 5/7 [08:07<05:07, 153.70s/it]2022-07-10 01:02:37,055 INFO [librispeech.py:67] Processing split: train-clean-360
Downloading train-clean-360.tar.gz: 21.5GB [22:24, 17.1MB/s]
Downloading LibriSpeech parts: 86%|███████████████████████████████████████████████████████████████████████████▍ | 6/7 [32:34<10:00, 600.06s/it]2022-07-10 01:27:03,584 INFO [librispeech.py:67] Processing split: train-other-500
librispeech.py:67] Processing split: train-other-500
Downloading train-other-500.tar.gz: 28.5GB [48:32, 10.5MB/s]
Downloading LibriSpeech parts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 7/7 [1:24:01<00:00, 720.22s/it]/28.5G [48:32<00:00, 13.5MB/s]
Downloading musan.tar.gz: 10.3GB [14:34, 12.7MB/s]
2022-07-10 02:34:20 (prepare.sh:94:main) Stage 1: Prepare LibriSpeech manifest
Dataset parts: 0%| | 0/7 [00:00<?, ?it/s]2022-07-10 02:34:20,692 INFO [librispeech.py:161] Processing LibriSpeech subset: test-other
Dataset parts: 14%|███████████████████▎ | 1/7 [00:10<01:04, 10.73s/it]2022-07-10 02:34:31,418 INFO [librispeech.py:161] Processing LibriSpeech subset: dev-other
Dataset parts: 29%|██████████████████████████████████████▌ | 2/7 [00:18<00:46, 9.21s/it]2022-07-10 02:34:39,574 INFO [librispeech.py:161] Processing LibriSpeech subset: train-other-500
Dataset parts: 43%|█████████████████████████████████████████████████████████▊ | 3/7 [03:43<06:33, 98.37s/it]2022-07-10 02:38:04,033 INFO [librispeech.py:161] Processing LibriSpeech subset: train-clean-360
Dataset parts: 57%|████████████████████████████████████████████████████████████████████████████▌ | 4/7 [09:07<09:22, 187.36s/it]2022-07-10 02:43:27,810 INFO [librispeech.py:161] Processing LibriSpeech subset: dev-clean
Dataset parts: 71%|███████████████████████████████████████████████████████████████████████████████████████████████▋ | 5/7 [09:12<04:03, 121.76s/it]2022-07-10 02:43:33,256 INFO [librispeech.py:161] Processing LibriSpeech subset: test-clean
Dataset parts: 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 6/7 [09:17<01:22, 82.18s/it]2022-07-10 02:43:38,604 INFO [librispeech.py:161] Processing LibriSpeech subset: train-clean-100
Dataset parts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [10:44<00:00, 92.04s/it]
2022-07-10 02:45:05 (prepare.sh:105:main) Stage 2: Prepare musan manifest
2022-07-10 02:45:06,001 WARNING [qa.py:115] There are 15 recordings that do not have any corresponding supervisions in the SupervisionSet.
2022-07-10 02:45:06 (prepare.sh:116:main) Stage 3: Compute fbank for librispeech
2022-07-10 02:45:16,626 INFO [compute_fbank_librispeech.py:77] Processing dev-clean
Extracting and storing features (chunks progress): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:04<00:00, 3.71it/s]
2022-07-10 02:45:21,218 INFO [compute_fbank_librispeech.py:77] Processing dev-other
Extracting and storing features (chunks progress): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:04<00:00, 3.73it/s]
2022-07-10 02:45:25,660 INFO [compute_fbank_librispeech.py:77] Processing test-clean
Extracting and storing features (chunks progress): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:04<00:00, 3.73it/s]
2022-07-10 02:45:30,035 INFO [compute_fbank_librispeech.py:77] Processing test-other
Extracting and storing features (chunks progress): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:03<00:00, 3.81it/s]
2022-07-10 02:45:34,355 INFO [compute_fbank_librispeech.py:77] Processing train-clean-100
Extracting and storing features (chunks progress): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [04:18<00:00, 17.24s/it]
2022-07-10 02:50:00,903 INFO [compute_fbank_librispeech.py:77] Processing train-clean-360
Extracting and storing features (chunks progress): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [20:39<00:00, 82.62s/it]
2022-07-10 03:11:07,929 INFO [compute_fbank_librispeech.py:77] Processing train-other-500
Extracting and storing features (chunks progress): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [35:42<00:00, 142.81s/it]
2022-07-10 03:47:32 (prepare.sh:124:main) Validating data/fbank for LibriSpeech
2022-07-10 03:47:33,046 INFO [validate_manifest.py:76] Validating data/fbank/librispeech_cuts_train-clean-100.jsonl.gz
2022-07-10 03:47:35,960 INFO [validate_manifest.py:76] Validating data/fbank/librispeech_cuts_train-clean-360.jsonl.gz
2022-07-10 03:47:50,841 INFO [validate_manifest.py:76] Validating data/fbank/librispeech_cuts_train-other-500.jsonl.gz
2022-07-10 03:48:10,677 INFO [validate_manifest.py:76] Validating data/fbank/librispeech_cuts_test-clean.jsonl.gz
2022-07-10 03:48:11,832 INFO [validate_manifest.py:76] Validating data/fbank/librispeech_cuts_test-other.jsonl.gz
2022-07-10 03:48:12,921 INFO [validate_manifest.py:76] Validating data/fbank/librispeech_cuts_dev-clean.jsonl.gz
2022-07-10 03:48:13,820 INFO [validate_manifest.py:76] Validating data/fbank/librispeech_cuts_dev-other.jsonl.gz
2022-07-10 03:48:14 (prepare.sh:143:main) Stage 4: Compute fbank for musan
2022-07-10 03:48:15,653 INFO [compute_fbank_musan.py:76] Extracting features for Musan
/home/np/anaconda3/lib/python3.8/site-packages/lhotse/lazy.py:357: UserWarning: A lambda was passed to LazyFilter: it may prevent you from forking this process. If you experience issues
with num_workers > 0 in torch.utils.data.DataLoader, try passing a regular function instead.
warnings.warn(
Extracting and storing features (chunks progress): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [04:22<00:00, 17.47s/it]
2022-07-10 03:52:40 (prepare.sh:152:main) Stage 5: Prepare phone based lang
2022-07-10 03:52:57 (prepare.sh:160:main) Stage 6: Prepare BPE based lang
2022-07-10 03:52:57 (prepare.sh:170:main) Generate data for BPE training
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: data/lang_bpe_5000/transcript_words.txt
input_format:
model_prefix: data/lang_bpe_5000/unigram_5000
model_type: UNIGRAM
vocab_size: 5000
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 100000000
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
user_defined_symbols: <blk>
user_defined_symbols: <sos/eos>
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 2
bos_id: -1
eos_id: -1
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(178) LOG(INFO) Loading corpus: data/lang_bpe_5000/transcript_words.txt
trainer_interface.cc(385) LOG(INFO) Loaded all 281241 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <blk>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <sos/eos>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=50063453
trainer_interface.cc(487) LOG(INFO) Alphabet size=28
trainer_interface.cc(488) LOG(INFO) Final character coverage=1
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 281241 sentences.
unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(194) LOG(INFO) Initialized 217434 seed sentencepieces
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 281241
trainer_interface.cc(537) LOG(INFO) Done! 89114
unigram_model_trainer.cc(489) LOG(INFO) Using 89114 sentences for EM training
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=78253 obj=9.17714 num_tokens=156548 num_tokens/piece=2.00054
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=56364 obj=7.27758 num_tokens=156323 num_tokens/piece=2.77345
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=42272 obj=7.21531 num_tokens=167392 num_tokens/piece=3.95988
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=42255 obj=7.20874 num_tokens=167370 num_tokens/piece=3.96095
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=31690 obj=7.24601 num_tokens=188741 num_tokens/piece=5.95585
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=31690 obj=7.23769 num_tokens=188708 num_tokens/piece=5.95481
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=23767 obj=7.3081 num_tokens=210750 num_tokens/piece=8.86734
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=23767 obj=7.29391 num_tokens=210740 num_tokens/piece=8.86692
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=17825 obj=7.39974 num_tokens=230364 num_tokens/piece=12.9236
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=17825 obj=7.37937 num_tokens=230349 num_tokens/piece=12.9228
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=13368 obj=7.52086 num_tokens=248609 num_tokens/piece=18.5973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=13368 obj=7.49488 num_tokens=248623 num_tokens/piece=18.5984
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=10026 obj=7.66953 num_tokens=265666 num_tokens/piece=26.4977
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=10026 obj=7.63806 num_tokens=265674 num_tokens/piece=26.4985
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=7519 obj=7.84159 num_tokens=281887 num_tokens/piece=37.49
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=7519 obj=7.80538 num_tokens=281942 num_tokens/piece=37.4973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=5639 obj=8.04293 num_tokens=296474 num_tokens/piece=52.5756
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=5639 obj=8.00099 num_tokens=296503 num_tokens/piece=52.5808
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=5500 obj=8.02083 num_tokens=297765 num_tokens/piece=54.1391
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=5500 obj=8.01652 num_tokens=297790 num_tokens/piece=54.1436
trainer_interface.cc(615) LOG(INFO) Saving model: data/lang_bpe_5000/unigram_5000.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: data/lang_bpe_5000/unigram_5000.vocab
2022-07-10 03:54:12 (prepare.sh:191:main) Validating data/lang_bpe_5000/lexicon.txt
2022-07-10 03:54:13 (prepare.sh:170:main) Generate data for BPE training
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: data/lang_bpe_2000/transcript_words.txt
input_format:
model_prefix: data/lang_bpe_2000/unigram_2000
model_type: UNIGRAM
vocab_size: 2000
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 100000000
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
user_defined_symbols: <blk>
user_defined_symbols: <sos/eos>
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 2
bos_id: -1
eos_id: -1
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(178) LOG(INFO) Loading corpus: data/lang_bpe_2000/transcript_words.txt
trainer_interface.cc(385) LOG(INFO) Loaded all 281241 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <blk>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <sos/eos>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=50063453
trainer_interface.cc(487) LOG(INFO) Alphabet size=28
trainer_interface.cc(488) LOG(INFO) Final character coverage=1
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 281241 sentences.
unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(194) LOG(INFO) Initialized 217434 seed sentencepieces
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 281241
trainer_interface.cc(537) LOG(INFO) Done! 89114
unigram_model_trainer.cc(489) LOG(INFO) Using 89114 sentences for EM training
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=78253 obj=9.17714 num_tokens=156548 num_tokens/piece=2.00054
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=56364 obj=7.27758 num_tokens=156323 num_tokens/piece=2.77345
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=42272 obj=7.21531 num_tokens=167392 num_tokens/piece=3.95988
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=42255 obj=7.20874 num_tokens=167370 num_tokens/piece=3.96095
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=31690 obj=7.24601 num_tokens=188741 num_tokens/piece=5.95585
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=31690 obj=7.23769 num_tokens=188708 num_tokens/piece=5.95481
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=23767 obj=7.3081 num_tokens=210750 num_tokens/piece=8.86734
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=23767 obj=7.29391 num_tokens=210740 num_tokens/piece=8.86692
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=17825 obj=7.39974 num_tokens=230364 num_tokens/piece=12.9236
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=17825 obj=7.37937 num_tokens=230349 num_tokens/piece=12.9228
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=13368 obj=7.52086 num_tokens=248609 num_tokens/piece=18.5973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=13368 obj=7.49488 num_tokens=248623 num_tokens/piece=18.5984
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=10026 obj=7.66953 num_tokens=265666 num_tokens/piece=26.4977
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=10026 obj=7.63806 num_tokens=265674 num_tokens/piece=26.4985
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=7519 obj=7.84159 num_tokens=281887 num_tokens/piece=37.49
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=7519 obj=7.80538 num_tokens=281942 num_tokens/piece=37.4973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=5639 obj=8.04293 num_tokens=296474 num_tokens/piece=52.5756
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=5639 obj=8.00099 num_tokens=296503 num_tokens/piece=52.5808
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=4229 obj=8.27122 num_tokens=310374 num_tokens/piece=73.3918
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=4229 obj=8.22217 num_tokens=310460 num_tokens/piece=73.4122
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=3171 obj=8.52656 num_tokens=323992 num_tokens/piece=102.173
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=3171 obj=8.46915 num_tokens=324023 num_tokens/piece=102.183
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2378 obj=8.81005 num_tokens=338454 num_tokens/piece=142.327
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2378 obj=8.74857 num_tokens=338467 num_tokens/piece=142.333
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2200 obj=8.83579 num_tokens=343377 num_tokens/piece=156.08
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2200 obj=8.81876 num_tokens=343409 num_tokens/piece=156.095
trainer_interface.cc(615) LOG(INFO) Saving model: data/lang_bpe_2000/unigram_2000.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: data/lang_bpe_2000/unigram_2000.vocab
2022-07-10 03:54:43 (prepare.sh:191:main) Validating data/lang_bpe_2000/lexicon.txt
2022-07-10 03:54:44 (prepare.sh:170:main) Generate data for BPE training
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: data/lang_bpe_1000/transcript_words.txt
input_format:
model_prefix: data/lang_bpe_1000/unigram_1000
model_type: UNIGRAM
vocab_size: 1000
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 100000000
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
user_defined_symbols: <blk>
user_defined_symbols: <sos/eos>
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 2
bos_id: -1
eos_id: -1
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(178) LOG(INFO) Loading corpus: data/lang_bpe_1000/transcript_words.txt
trainer_interface.cc(385) LOG(INFO) Loaded all 281241 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <blk>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <sos/eos>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=50063453
trainer_interface.cc(487) LOG(INFO) Alphabet size=28
trainer_interface.cc(488) LOG(INFO) Final character coverage=1
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 281241 sentences.
unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(194) LOG(INFO) Initialized 217434 seed sentencepieces
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 281241
trainer_interface.cc(537) LOG(INFO) Done! 89114
unigram_model_trainer.cc(489) LOG(INFO) Using 89114 sentences for EM training
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=78253 obj=9.17714 num_tokens=156548 num_tokens/piece=2.00054
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=56364 obj=7.27758 num_tokens=156323 num_tokens/piece=2.77345
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=42272 obj=7.21531 num_tokens=167392 num_tokens/piece=3.95988
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=42255 obj=7.20874 num_tokens=167370 num_tokens/piece=3.96095
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=31690 obj=7.24601 num_tokens=188741 num_tokens/piece=5.95585
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=31690 obj=7.23769 num_tokens=188708 num_tokens/piece=5.95481
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=23767 obj=7.3081 num_tokens=210750 num_tokens/piece=8.86734
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=23767 obj=7.29391 num_tokens=210740 num_tokens/piece=8.86692
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=17825 obj=7.39974 num_tokens=230364 num_tokens/piece=12.9236
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=17825 obj=7.37937 num_tokens=230349 num_tokens/piece=12.9228
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=13368 obj=7.52086 num_tokens=248609 num_tokens/piece=18.5973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=13368 obj=7.49488 num_tokens=248623 num_tokens/piece=18.5984
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=10026 obj=7.66953 num_tokens=265666 num_tokens/piece=26.4977
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=10026 obj=7.63806 num_tokens=265674 num_tokens/piece=26.4985
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=7519 obj=7.84159 num_tokens=281887 num_tokens/piece=37.49
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=7519 obj=7.80538 num_tokens=281942 num_tokens/piece=37.4973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=5639 obj=8.04293 num_tokens=296474 num_tokens/piece=52.5756
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=5639 obj=8.00099 num_tokens=296503 num_tokens/piece=52.5808
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=4229 obj=8.27122 num_tokens=310374 num_tokens/piece=73.3918
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=4229 obj=8.22217 num_tokens=310460 num_tokens/piece=73.4122
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=3171 obj=8.52656 num_tokens=323992 num_tokens/piece=102.173
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=3171 obj=8.46915 num_tokens=324023 num_tokens/piece=102.183
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2378 obj=8.81005 num_tokens=338454 num_tokens/piece=142.327
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2378 obj=8.74857 num_tokens=338467 num_tokens/piece=142.333
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=1783 obj=9.10926 num_tokens=353154 num_tokens/piece=198.067
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=1783 obj=9.04018 num_tokens=353178 num_tokens/piece=198.081
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=1337 obj=9.42776 num_tokens=367628 num_tokens/piece=274.965
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=1337 obj=9.35333 num_tokens=367718 num_tokens/piece=275.032
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=1100 obj=9.62249 num_tokens=379075 num_tokens/piece=344.614
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=1100 obj=9.56795 num_tokens=379135 num_tokens/piece=344.668
trainer_interface.cc(615) LOG(INFO) Saving model: data/lang_bpe_1000/unigram_1000.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: data/lang_bpe_1000/unigram_1000.vocab
2022-07-10 03:55:18 (prepare.sh:191:main) Validating data/lang_bpe_1000/lexicon.txt
2022-07-10 03:55:19 (prepare.sh:170:main) Generate data for BPE training
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: data/lang_bpe_500/transcript_words.txt
input_format:
model_prefix: data/lang_bpe_500/unigram_500
model_type: UNIGRAM
vocab_size: 500
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 100000000
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
user_defined_symbols: <blk>
user_defined_symbols: <sos/eos>
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 2
bos_id: -1
eos_id: -1
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(178) LOG(INFO) Loading corpus: data/lang_bpe_500/transcript_words.txt
trainer_interface.cc(385) LOG(INFO) Loaded all 281241 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <blk>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <sos/eos>
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=50063453
trainer_interface.cc(487) LOG(INFO) Alphabet size=28
trainer_interface.cc(488) LOG(INFO) Final character coverage=1
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 281241 sentences.
unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(194) LOG(INFO) Initialized 217434 seed sentencepieces
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 281241
trainer_interface.cc(537) LOG(INFO) Done! 89114
unigram_model_trainer.cc(489) LOG(INFO) Using 89114 sentences for EM training
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=78253 obj=9.17714 num_tokens=156548 num_tokens/piece=2.00054
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=56364 obj=7.27758 num_tokens=156323 num_tokens/piece=2.77345
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=42272 obj=7.21531 num_tokens=167392 num_tokens/piece=3.95988
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=42255 obj=7.20874 num_tokens=167370 num_tokens/piece=3.96095
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=31690 obj=7.24601 num_tokens=188741 num_tokens/piece=5.95585
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=31690 obj=7.23769 num_tokens=188708 num_tokens/piece=5.95481
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=23767 obj=7.3081 num_tokens=210750 num_tokens/piece=8.86734
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=23767 obj=7.29391 num_tokens=210740 num_tokens/piece=8.86692
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=17825 obj=7.39974 num_tokens=230364 num_tokens/piece=12.9236
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=17825 obj=7.37937 num_tokens=230349 num_tokens/piece=12.9228
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=13368 obj=7.52086 num_tokens=248609 num_tokens/piece=18.5973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=13368 obj=7.49488 num_tokens=248623 num_tokens/piece=18.5984
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=10026 obj=7.66953 num_tokens=265666 num_tokens/piece=26.4977
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=10026 obj=7.63806 num_tokens=265674 num_tokens/piece=26.4985
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=7519 obj=7.84159 num_tokens=281887 num_tokens/piece=37.49
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=7519 obj=7.80538 num_tokens=281942 num_tokens/piece=37.4973
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=5639 obj=8.04293 num_tokens=296474 num_tokens/piece=52.5756
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=5639 obj=8.00099 num_tokens=296503 num_tokens/piece=52.5808
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=4229 obj=8.27122 num_tokens=310374 num_tokens/piece=73.3918
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=4229 obj=8.27122 num_tokens=310374 num_tokens/piece=73.3918
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=4229 obj=8.22217 num_tokens=310460 num_tokens/piece=73.4122
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=3171 obj=8.52656 num_tokens=323992 num_tokens/piece=102.173
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=3171 obj=8.46915 num_tokens=324023 num_tokens/piece=102.183
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=2378 obj=8.81005 num_tokens=338454 num_tokens/piece=142.327
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=2378 obj=8.74857 num_tokens=338467 num_tokens/piece=142.333
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=1783 obj=9.10926 num_tokens=353154 num_tokens/piece=198.067
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=1783 obj=9.04018 num_tokens=353178 num_tokens/piece=198.081
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=1337 obj=9.42776 num_tokens=367628 num_tokens/piece=274.965
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=1337 obj=9.35333 num_tokens=367718 num_tokens/piece=275.032
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=1002 obj=9.76223 num_tokens=383134 num_tokens/piece=382.369
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=1002 obj=9.6788 num_tokens=383153 num_tokens/piece=382.388
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=751 obj=10.1153 num_tokens=396353 num_tokens/piece=527.767
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=751 obj=10.0206 num_tokens=396423 num_tokens/piece=527.86
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=563 obj=10.4863 num_tokens=413680 num_tokens/piece=734.778
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=563 obj=10.3718 num_tokens=413707 num_tokens/piece=734.826
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=550 obj=10.4027 num_tokens=416637 num_tokens/piece=757.522
unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=550 obj=10.3889 num_tokens=416637 num_tokens/piece=757.522
trainer_interface.cc(615) LOG(INFO) Saving model: data/lang_bpe_500/unigram_500.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: data/lang_bpe_500/unigram_500.vocab
2022-07-10 03:55:54 (prepare.sh:191:main) Validating data/lang_bpe_500/lexicon.txt
2022-07-10 03:55:56 (prepare.sh:200:main) Stage 7: Prepare bigram P
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):79
[I] Reading \data\ section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \1-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \2-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):79
[I] Reading \data\ section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \1-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \2-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):79
[I] Reading \data\ section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \1-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \2-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):79
[I] Reading \data\ section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \1-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \2-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \3-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):79
[I] Reading \data\ section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \1-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \2-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \3-grams: section.
/tmp/pip-install-j6dhjg47/kaldilm_74148b2738ab40148b530837cf874916/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \4-grams: section.
2022-07-10 04:02:57 (prepare.sh:256:main) Stage 9: Compile HLG
2022-07-10 04:02:58,942 INFO [compile_hlg.py:145] Processing data/lang_phone
2022-07-10 04:02:59,101 INFO [lexicon.py:179] Converting L.pt to Linv.pt
2022-07-10 04:02:59,515 INFO [compile_hlg.py:64] Building ctc_topo. max_token_id: 71
2022-07-10 04:02:59,721 INFO [compile_hlg.py:73] Loading G_3_gram.fst.txt
2022-07-10 04:03:06,004 INFO [compile_hlg.py:84] Intersecting L and G
2022-07-10 04:03:24,953 INFO [compile_hlg.py:86] LG shape: (19672550, None)
2022-07-10 04:03:24,953 INFO [compile_hlg.py:88] Connecting LG
2022-07-10 04:03:24,953 INFO [compile_hlg.py:90] LG shape after k2.connect: (19672550, None)
2022-07-10 04:03:24,953 INFO [compile_hlg.py:92] <class 'torch.Tensor'>
2022-07-10 04:03:24,953 INFO [compile_hlg.py:93] Determinizing LG
2022-07-10 04:03:54,207 INFO [compile_hlg.py:96] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:03:54,207 INFO [compile_hlg.py:98] Connecting LG after k2.determinize
2022-07-10 04:03:54,207 INFO [compile_hlg.py:101] Removing disambiguation symbols on LG
2022-07-10 04:04:10,625 INFO [compile_hlg.py:112] LG shape after k2.remove_epsilon: (15949565, None)
2022-07-10 04:04:15,864 INFO [compile_hlg.py:117] Arc sorting LG
2022-07-10 04:04:15,864 INFO [compile_hlg.py:120] Composing H and LG
2022-07-10 04:04:46,049 INFO [compile_hlg.py:127] Connecting LG
2022-07-10 04:04:46,049 INFO [compile_hlg.py:130] Arc sorting LG
2022-07-10 04:04:50,172 INFO [compile_hlg.py:132] HLG.shape: (22644773, None)
2022-07-10 04:04:50,207 INFO [compile_hlg.py:148] Saving HLG.pt to data/lang_phone
2022-07-10 04:04:53,757 INFO [compile_hlg.py:145] Processing data/lang_bpe_5000
2022-07-10 04:04:53,964 INFO [lexicon.py:179] Converting L.pt to Linv.pt
2022-07-10 04:04:54,170 INFO [compile_hlg.py:64] Building ctc_topo. max_token_id: 4999
2022-07-10 04:04:54,740 INFO [compile_hlg.py:69] Loading pre-compiled G_3_gram
2022-07-10 04:04:55,074 INFO [compile_hlg.py:84] Intersecting L and G
2022-07-10 04:05:14,249 INFO [compile_hlg.py:96] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:05:14,249 INFO [compile_hlg.py:98] Connecting LG after k2.determinize
2022-07-10 04:05:14,249 INFO [compile_hlg.py:101] Removing disambiguation symbols on LG
2022-07-10 04:05:24,868 INFO [compile_hlg.py:112] LG shape after k2.remove_epsilon: (4115271, None)
2022-07-10 04:05:26,657 INFO [compile_hlg.py:117] Arc sorting LG
2022-07-10 04:05:26,657 INFO [compile_hlg.py:120] Composing H and LG
2022-07-10 04:05:51,130 INFO [compile_hlg.py:127] Connecting LG
2022-07-10 04:05:51,130 INFO [compile_hlg.py:130] Arc sorting LG
2022-07-10 04:05:57,205 INFO [compile_hlg.py:132] HLG.shape: (3642022, None)
2022-07-10 04:05:57,266 INFO [compile_hlg.py:148] Saving HLG.pt to data/lang_bpe_5000
2022-07-10 04:06:01,492 INFO [compile_hlg.py:145] Processing data/lang_bpe_2000
2022-07-10 04:06:01,694 INFO [lexicon.py:179] Converting L.pt to Linv.pt
2022-07-10 04:06:01,926 INFO [compile_hlg.py:64] Building ctc_topo. max_token_id: 1999
2022-07-10 04:06:02,121 INFO [compile_hlg.py:69] Loading pre-compiled G_3_gram
2022-07-10 04:06:02,479 INFO [compile_hlg.py:84] Intersecting L and G
2022-07-10 04:06:10,426 INFO [compile_hlg.py:86] LG shape: (7240389, None)
2022-07-10 04:06:10,426 INFO [compile_hlg.py:88] Connecting LG
2022-07-10 04:06:10,426 INFO [compile_hlg.py:90] LG shape after k2.connect: (7240389, None)
2022-07-10 04:06:10,426 INFO [compile_hlg.py:92] <class 'torch.Tensor'>
2022-07-10 04:06:10,426 INFO [compile_hlg.py:93] Determinizing LG
2022-07-10 04:06:23,286 INFO [compile_hlg.py:96] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:06:23,286 INFO [compile_hlg.py:98] Connecting LG after k2.determinize
2022-07-10 04:06:23,286 INFO [compile_hlg.py:101] Removing disambiguation symbols on LG
2022-07-10 04:06:34,096 INFO [compile_hlg.py:112] LG shape after k2.remove_epsilon: (4660603, None)
2022-07-10 04:06:36,115 INFO [compile_hlg.py:117] Arc sorting LG
2022-07-10 04:06:36,115 INFO [compile_hlg.py:120] Composing H and LG
2022-07-10 04:06:49,833 INFO [compile_hlg.py:127] Connecting LG
2022-07-10 04:06:49,833 INFO [compile_hlg.py:130] Arc sorting LG
2022-07-10 04:06:52,432 INFO [compile_hlg.py:132] HLG.shape: (4779936, None)
2022-07-10 04:06:52,466 INFO [compile_hlg.py:148] Saving HLG.pt to data/lang_bpe_2000
2022-07-10 04:06:54,830 INFO [compile_hlg.py:145] Processing data/lang_bpe_1000
2022-07-10 04:06:55,034 INFO [lexicon.py:179] Converting L.pt to Linv.pt
2022-07-10 04:06:55,280 INFO [compile_hlg.py:64] Building ctc_topo. max_token_id: 999
2022-07-10 04:06:55,424 INFO [compile_hlg.py:69] Loading pre-compiled G_3_gram
2022-07-10 04:06:55,771 INFO [compile_hlg.py:84] Intersecting L and G
2022-07-10 04:07:04,583 INFO [compile_hlg.py:86] LG shape: (8021681, None)
2022-07-10 04:07:04,583 INFO [compile_hlg.py:88] Connecting LG
2022-07-10 04:07:04,583 INFO [compile_hlg.py:90] LG shape after k2.connect: (8021681, None)
2022-07-10 04:07:04,583 INFO [compile_hlg.py:92] <class 'torch.Tensor'>
2022-07-10 04:07:04,583 INFO [compile_hlg.py:93] Determinizing LG
2022-07-10 04:07:18,274 INFO [compile_hlg.py:96] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:07:18,274 INFO [compile_hlg.py:98] Connecting LG after k2.determinize
2022-07-10 04:07:18,274 INFO [compile_hlg.py:101] Removing disambiguation symbols on LG
2022-07-10 04:07:29,087 INFO [compile_hlg.py:112] LG shape after k2.remove_epsilon: (5124010, None)
2022-07-10 04:07:31,136 INFO [compile_hlg.py:117] Arc sorting LG
2022-07-10 04:07:31,136 INFO [compile_hlg.py:120] Composing H and LG
2022-07-10 04:07:43,104 INFO [compile_hlg.py:127] Connecting LG
2022-07-10 04:07:43,104 INFO [compile_hlg.py:130] Arc sorting LG
2022-07-10 04:07:45,303 INFO [compile_hlg.py:132] HLG.shape: (5728399, None)
2022-07-10 04:07:45,329 INFO [compile_hlg.py:148] Saving HLG.pt to data/lang_bpe_1000
2022-07-10 04:07:47,462 INFO [compile_hlg.py:145] Processing data/lang_bpe_500
2022-07-10 04:07:47,651 INFO [lexicon.py:179] Converting L.pt to Linv.pt
2022-07-10 04:07:47,841 INFO [compile_hlg.py:64] Building ctc_topo. max_token_id: 499
2022-07-10 04:07:47,987 INFO [compile_hlg.py:69] Loading pre-compiled G_3_gram
2022-07-10 04:07:48,192 INFO [compile_hlg.py:84] Intersecting L and G
2022-07-10 04:07:56,016 INFO [compile_hlg.py:86] LG shape: (8908536, None)
2022-07-10 04:07:56,016 INFO [compile_hlg.py:88] Connecting LG
2022-07-10 04:07:56,016 INFO [compile_hlg.py:90] LG shape after k2.connect: (8908536, None)
2022-07-10 04:07:56,016 INFO [compile_hlg.py:92] <class 'torch.Tensor'>
2022-07-10 04:07:56,016 INFO [compile_hlg.py:93] Determinizing LG
2022-07-10 04:08:06,775 INFO [compile_hlg.py:96] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:08:06,775 INFO [compile_hlg.py:98] Connecting LG after k2.determinize
2022-07-10 04:08:06,775 INFO [compile_hlg.py:101] Removing disambiguation symbols on LG
2022-07-10 04:08:13,957 INFO [compile_hlg.py:112] LG shape after k2.remove_epsilon: (5632199, None)
2022-07-10 04:08:15,495 INFO [compile_hlg.py:117] Arc sorting LG
2022-07-10 04:08:15,495 INFO [compile_hlg.py:120] Composing H and LG
2022-07-10 04:08:24,955 INFO [compile_hlg.py:127] Connecting LG
2022-07-10 04:08:24,955 INFO [compile_hlg.py:130] Arc sorting LG
2022-07-10 04:08:26,417 INFO [compile_hlg.py:132] HLG.shape: (6798414, None)
2022-07-10 04:08:26,434 INFO [compile_hlg.py:148] Saving HLG.pt to data/lang_bpe_500
2022-07-10 04:08:27 (prepare.sh:267:main) Stage 10: Compile LG
2022-07-10 04:08:28,451 INFO [compile_lg.py:127] Processing data/lang_phone
2022-07-10 04:08:28,602 INFO [lexicon.py:176] Loading pre-compiled data/lang_phone/Linv.pt
2022-07-10 04:08:28,733 INFO [compile_lg.py:65] Loading pre-compiled G_3_gram
2022-07-10 04:08:29,134 INFO [compile_lg.py:80] Intersecting L and G
2022-07-10 04:08:53,539 INFO [compile_lg.py:82] LG shape: (19672550, None)
2022-07-10 04:08:53,539 INFO [compile_lg.py:84] Connecting LG
2022-07-10 04:08:53,539 INFO [compile_lg.py:86] LG shape after k2.connect: (19672550, None)
2022-07-10 04:08:53,539 INFO [compile_lg.py:88] <class 'torch.Tensor'>
2022-07-10 04:08:53,539 INFO [compile_lg.py:89] Determinizing LG
2022-07-10 04:09:33,547 INFO [compile_lg.py:92] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:09:33,547 INFO [compile_lg.py:94] Connecting LG after k2.determinize
2022-07-10 04:09:33,547 INFO [compile_lg.py:97] Removing disambiguation symbols on LG
2022-07-10 04:09:49,631 INFO [compile_lg.py:108] LG shape after k2.remove_epsilon: (15949565, None)
2022-07-10 04:09:54,809 INFO [compile_lg.py:113] Arc sorting LG
2022-07-10 04:09:54,822 INFO [compile_lg.py:130] Saving LG.pt to data/lang_phone
2022-07-10 04:09:56,664 INFO [compile_lg.py:127] Processing data/lang_bpe_5000
2022-07-10 04:09:56,833 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_5000/Linv.pt
2022-07-10 04:09:56,894 INFO [compile_lg.py:65] Loading pre-compiled G_3_gram
2022-07-10 04:09:57,246 INFO [compile_lg.py:80] Intersecting L and G
2022-07-10 04:10:04,471 INFO [compile_lg.py:82] LG shape: (6357921, None)
2022-07-10 04:10:04,471 INFO [compile_lg.py:84] Connecting LG
2022-07-10 04:10:04,471 INFO [compile_lg.py:86] LG shape after k2.connect: (6357921, None)
2022-07-10 04:10:04,471 INFO [compile_lg.py:88] <class 'torch.Tensor'>
2022-07-10 04:10:04,471 INFO [compile_lg.py:89] Determinizing LG
2022-07-10 04:10:16,587 INFO [compile_lg.py:92] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:10:16,587 INFO [compile_lg.py:94] Connecting LG after k2.determinize
2022-07-10 04:10:16,587 INFO [compile_lg.py:97] Removing disambiguation symbols on LG
2022-07-10 04:10:27,214 INFO [compile_lg.py:108] LG shape after k2.remove_epsilon: (4115271, None)
2022-07-10 04:10:29,118 INFO [compile_lg.py:113] Arc sorting LG
2022-07-10 04:10:29,134 INFO [compile_lg.py:130] Saving LG.pt to data/lang_bpe_5000
2022-07-10 04:10:30,557 INFO [compile_lg.py:127] Processing data/lang_bpe_2000
2022-07-10 04:10:30,711 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_2000/Linv.pt
2022-07-10 04:10:30,769 INFO [compile_lg.py:65] Loading pre-compiled G_3_gram
2022-07-10 04:10:31,125 INFO [compile_lg.py:80] Intersecting L and G
2022-07-10 04:10:39,103 INFO [compile_lg.py:82] LG shape: (7240389, None)
2022-07-10 04:10:39,103 INFO [compile_lg.py:84] Connecting LG
2022-07-10 04:10:39,103 INFO [compile_lg.py:86] LG shape after k2.connect: (7240389, None)
2022-07-10 04:10:39,103 INFO [compile_lg.py:88] <class 'torch.Tensor'>
2022-07-10 04:10:39,103 INFO [compile_lg.py:89] Determinizing LG
2022-07-10 04:10:52,056 INFO [compile_lg.py:92] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:10:52,056 INFO [compile_lg.py:94] Connecting LG after k2.determinize
2022-07-10 04:10:52,056 INFO [compile_lg.py:97] Removing disambiguation symbols on LG
2022-07-10 04:11:02,576 INFO [compile_lg.py:108] LG shape after k2.remove_epsilon: (4660603, None)
2022-07-10 04:11:04,521 INFO [compile_lg.py:113] Arc sorting LG
2022-07-10 04:11:04,538 INFO [compile_lg.py:130] Saving LG.pt to data/lang_bpe_2000
2022-07-10 04:11:05,634 INFO [compile_lg.py:127] Processing data/lang_bpe_1000
2022-07-10 04:11:05,739 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_1000/Linv.pt
2022-07-10 04:11:05,783 INFO [compile_lg.py:65] Loading pre-compiled G_3_gram
2022-07-10 04:11:05,984 INFO [compile_lg.py:80] Intersecting L and G
2022-07-10 04:11:12,252 INFO [compile_lg.py:82] LG shape: (8021681, None)
2022-07-10 04:11:12,252 INFO [compile_lg.py:84] Connecting LG
2022-07-10 04:11:12,252 INFO [compile_lg.py:86] LG shape after k2.connect: (8021681, None)
2022-07-10 04:11:12,252 INFO [compile_lg.py:88] <class 'torch.Tensor'>
2022-07-10 04:11:12,252 INFO [compile_lg.py:89] Determinizing LG
2022-07-10 04:11:22,180 INFO [compile_lg.py:92] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:11:22,180 INFO [compile_lg.py:94] Connecting LG after k2.determinize
2022-07-10 04:11:22,180 INFO [compile_lg.py:97] Removing disambiguation symbols on LG
2022-07-10 04:11:29,181 INFO [compile_lg.py:108] LG shape after k2.remove_epsilon: (5124010, None)
2022-07-10 04:11:30,632 INFO [compile_lg.py:113] Arc sorting LG
022-07-10 04:11:30,644 INFO [compile_lg.py:130] Saving LG.pt to data/lang_bpe_1000
2022-07-10 04:11:31,969 INFO [compile_lg.py:127] Processing data/lang_bpe_500
2022-07-10 04:11:32,122 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_500/Linv.pt
2022-07-10 04:11:32,201 INFO [compile_lg.py:65] Loading pre-compiled G_3_gram
2022-07-10 04:11:32,569 INFO [compile_lg.py:80] Intersecting L and G
2022-07-10 04:11:43,294 INFO [compile_lg.py:82] LG shape: (8908536, None)
2022-07-10 04:11:43,294 INFO [compile_lg.py:84] Connecting LG
2022-07-10 04:11:43,295 INFO [compile_lg.py:86] LG shape after k2.connect: (8908536, None)
2022-07-10 04:11:43,295 INFO [compile_lg.py:88] <class 'torch.Tensor'>
2022-07-10 04:11:43,295 INFO [compile_lg.py:89] Determinizing LG
2022-07-10 04:11:57,534 INFO [compile_lg.py:92] <class '_k2.ragged.RaggedTensor'>
2022-07-10 04:11:57,535 INFO [compile_lg.py:94] Connecting LG after k2.determinize
2022-07-10 04:11:57,535 INFO [compile_lg.py:97] Removing disambiguation symbols on LG
2022-07-10 04:12:08,139 INFO [compile_lg.py:108] LG shape after k2.remove_epsilon: (5632199, None)
2022-07-10 04:12:10,250 INFO [compile_lg.py:113] Arc sorting LG
2022-07-10 04:12:10,266 INFO [compile_lg.py:130] Saving LG.pt to data/lang_bpe_500
2022-07-10 04:12:10 (prepare.sh:277:main) Stage 11: Generate LM training data
2022-07-10 04:12:10 (prepare.sh:280:main) Processing vocab_size == 5000
2022-07-10 04:12:11,261 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 04:12:35,173 INFO [prepare_lm_training_data.py:114] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 04:12:56,881 INFO [prepare_lm_training_data.py:114] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 04:13:18,864 INFO [prepare_lm_training_data.py:114] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 04:13:39,738 INFO [prepare_lm_training_data.py:114] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 04:14:01,432 INFO [prepare_lm_training_data.py:114] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 04:14:24,489 INFO [prepare_lm_training_data.py:114] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 04:14:42,084 INFO [prepare_lm_training_data.py:114] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 04:15:06,628 INFO [prepare_lm_training_data.py:114] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 04:15:07,614 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 04:15:26,413 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 40418261
2022-07-10 04:15:26,486 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 04:16:28,149 INFO [prepare_lm_training_data.py:139] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 04:17:29,864 INFO [prepare_lm_training_data.py:139] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 04:18:31,630 INFO [prepare_lm_training_data.py:139] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 04:19:33,628 INFO [prepare_lm_training_data.py:139] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 04:20:35,147 INFO [prepare_lm_training_data.py:139] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 04:21:36,576 INFO [prepare_lm_training_data.py:139] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 04:22:37,927 INFO [prepare_lm_training_data.py:139] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 04:23:39,243 INFO [prepare_lm_training_data.py:139] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 04:23:47,662 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_5000/lm_data.pt
2022-07-10 04:23:48 (prepare.sh:280:main) Processing vocab_size == 2000
2022-07-10 04:23:48,838 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 04:24:22,536 INFO [prepare_lm_training_data.py:114] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 04:24:53,371 INFO [prepare_lm_training_data.py:114] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 04:25:24,302 INFO [prepare_lm_training_data.py:114] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 04:25:53,690 INFO [prepare_lm_training_data.py:114] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 04:25:53,690 INFO [prepare_lm_training_data.py:114] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 04:26:24,296 INFO [prepare_lm_training_data.py:114] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 04:26:56,986 INFO [prepare_lm_training_data.py:114] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 04:27:23,069 INFO [prepare_lm_training_data.py:114] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 04:27:57,161 INFO [prepare_lm_training_data.py:114] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 04:27:58,577 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 04:28:27,279 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 40418261
2022-07-10 04:28:27,366 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 04:31:01,948 INFO [prepare_lm_training_data.py:139] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 04:33:37,078 INFO [prepare_lm_training_data.py:139] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 04:36:11,987 INFO [prepare_lm_training_data.py:139] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 04:38:46,998 INFO [prepare_lm_training_data.py:139] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 04:41:21,867 INFO [prepare_lm_training_data.py:139] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 04:43:56,259 INFO [prepare_lm_training_data.py:139] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 04:46:31,133 INFO [prepare_lm_training_data.py:139] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 04:49:06,169 INFO [prepare_lm_training_data.py:139] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 04:49:23,583 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_2000/lm_data.pt
2022-07-10 04:49:24 (prepare.sh:280:main) Processing vocab_size == 1000
2022-07-10 04:49:25,001 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 04:49:47,929 INFO [prepare_lm_training_data.py:114] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 04:50:09,161 INFO [prepare_lm_training_data.py:114] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 04:50:30,701 INFO [prepare_lm_training_data.py:114] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 04:50:50,929 INFO [prepare_lm_training_data.py:114] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 04:51:12,170 INFO [prepare_lm_training_data.py:114] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 04:51:34,847 INFO [prepare_lm_training_data.py:114] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 04:51:52,203 INFO [prepare_lm_training_data.py:114] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 04:52:16,296 INFO [prepare_lm_training_data.py:114] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 04:52:17,252 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 04:52:35,941 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 40418261
2022-07-10 04:52:36,016 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 04:53:38,300 INFO [prepare_lm_training_data.py:139] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 04:54:40,338 INFO [prepare_lm_training_data.py:139] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 04:55:42,350 INFO [prepare_lm_training_data.py:139] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 04:56:44,303 INFO [prepare_lm_training_data.py:139] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 04:57:46,176 INFO [prepare_lm_training_data.py:139] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 04:58:48,261 INFO [prepare_lm_training_data.py:139] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 04:59:50,456 INFO [prepare_lm_training_data.py:139] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 05:00:52,358 INFO [prepare_lm_training_data.py:139] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 05:01:00,789 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_1000/lm_data.pt
2022-07-10 05:01:01 (prepare.sh:280:main) Processing vocab_size == 500
2022-07-10 05:01:01,913 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:01:35,574 INFO [prepare_lm_training_data.py:114] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 05:02:05,829 INFO [prepare_lm_training_data.py:114] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 05:02:35,733 INFO [prepare_lm_training_data.py:114] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 05:03:05,333 INFO [prepare_lm_training_data.py:114] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 05:03:36,005 INFO [prepare_lm_training_data.py:114] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 05:04:07,982 INFO [prepare_lm_training_data.py:114] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 05:04:33,926 INFO [prepare_lm_training_data.py:114] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 05:05:07,909 INFO [prepare_lm_training_data.py:114] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 05:05:09,366 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:05:37,205 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 40418261
2022-07-10 05:05:37,294 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:08:11,583 INFO [prepare_lm_training_data.py:139] Processed number of lines: 5000000 ( 12.371%)
2022-07-10 05:10:45,602 INFO [prepare_lm_training_data.py:139] Processed number of lines: 10000000 ( 24.741%)
2022-07-10 05:13:19,655 INFO [prepare_lm_training_data.py:139] Processed number of lines: 15000000 ( 37.112%)
2022-07-10 05:15:54,192 INFO [prepare_lm_training_data.py:139] Processed number of lines: 20000000 ( 49.483%)
2022-07-10 05:18:28,510 INFO [prepare_lm_training_data.py:139] Processed number of lines: 25000000 ( 61.853%)
2022-07-10 05:21:02,542 INFO [prepare_lm_training_data.py:139] Processed number of lines: 30000000 ( 74.224%)
2022-07-10 05:23:37,814 INFO [prepare_lm_training_data.py:139] Processed number of lines: 35000000 ( 86.595%)
2022-07-10 05:26:12,227 INFO [prepare_lm_training_data.py:139] Processed number of lines: 40000000 ( 98.965%)
2022-07-10 05:26:29,689 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_500/lm_data.pt
2022-07-10 05:26:30 (prepare.sh:293:main) Stage 12: Generate LM validation data
2022-07-10 05:26:30 (prepare.sh:296:main) Processing vocab_size == 5000
2022-07-10 05:26:33,220 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:33,242 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:33,254 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:33,258 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5567
2022-07-10 05:26:33,258 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:33,294 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:33,327 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_5000/lm_data-valid.pt
2022-07-10 05:26:33 (prepare.sh:296:main) Processing vocab_size == 2000
2022-07-10 05:26:34,002 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:34,051 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:34,077 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:34,082 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5567
2022-07-10 05:26:34,082 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:34,181 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:34,264 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_2000/lm_data-valid.pt
2022-07-10 05:26:34 (prepare.sh:296:main) Processing vocab_size == 1000
2022-07-10 05:26:34,976 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:35,025 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:35,050 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:35,055 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5567
2022-07-10 05:26:35,055 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:35,147 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:35,225 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_1000/lm_data-valid.pt
2022-07-10 05:26:35 (prepare.sh:296:main) Processing vocab_size == 500
2022-07-10 05:26:35,916 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:35,967 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:35,991 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:35,996 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5567
2022-07-10 05:26:35,997 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:36,090 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.889%)
2022-07-10 05:26:36,170 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_500/lm_data-valid.pt
2022-07-10 05:26:36 (prepare.sh:319:main) Stage 13: Generate LM test data
2022-07-10 05:26:36 (prepare.sh:322:main) Processing vocab_size == 5000
2022-07-10 05:26:45,702 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:45,750 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:45,773 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:45,778 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5559
2022-07-10 05:26:45,778 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:45,870 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:45,948 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_5000/lm_data-test.pt
2022-07-10 05:26:46 (prepare.sh:322:main) Processing vocab_size == 2000
2022-07-10 05:26:46,646 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:46,696 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:46,720 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:46,725 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5559
2022-07-10 05:26:46,725 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:46,816 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:46,895 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_2000/lm_data-test.pt
2022-07-10 05:26:46 (prepare.sh:322:main) Processing vocab_size == 1000
2022-07-10 05:26:47,554 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:47,607 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:47,632 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:47,637 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5559
022-07-10 05:26:47,637 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5559
2022-07-10 05:26:47,637 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:47,731 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:47,813 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_1000/lm_data-test.pt
2022-07-10 05:26:47 (prepare.sh:322:main) Processing vocab_size == 500
2022-07-10 05:26:48,501 INFO [prepare_lm_training_data.py:114] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:48,551 INFO [prepare_lm_training_data.py:114] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:48,575 INFO [prepare_lm_training_data.py:128] Constructing ragged tensors
2022-07-10 05:26:48,580 INFO [prepare_lm_training_data.py:135] Computing sentence lengths, num_sentences: 5559
2022-07-10 05:26:48,580 INFO [prepare_lm_training_data.py:139] Processed number of lines: 0 ( 0.000%)
2022-07-10 05:26:48,674 INFO [prepare_lm_training_data.py:139] Processed number of lines: 3000 ( 53.967%)
2022-07-10 05:26:48,756 INFO [prepare_lm_training_data.py:162] Saved to data/lm_training_bpe_500/lm_data-test.pt
2022-07-10 05:26:48 (prepare.sh:345:main) Stage 14: Sort LM training data
2022-07-10 05:56:55,551 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_5000/sorted_lm_data.pt
2022-07-10 05:56:58,055 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_5000/sorted_lm_data-valid.pt
2022-07-10 05:56:58,680 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_5000/sorted_lm_data-test.pt
2022-07-10 06:27:53,878 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_2000/sorted_lm_data.pt
2022-07-10 06:27:56,342 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_2000/sorted_lm_data-valid.pt
2022-07-10 06:27:56,966 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_2000/sorted_lm_data-test.pt
2022-07-10 06:38:43,245 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_1000/sorted_lm_data.pt
2022-07-10 06:38:45,484 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_1000/sorted_lm_data-valid.pt
2022-07-10 06:38:46,119 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_1000/sorted_lm_data-test.pt
2022-07-10 07:10:09,733 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_500/sorted_lm_data.pt
2022-07-10 07:10:12,256 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_500/sorted_lm_data-valid.pt
2022-07-10 07:10:13,076 INFO [sort_lm_training_data.py:104] Saved to data/lm_training_bpe_500/sorted_lm_data-test.pt
np@np-INTEL:/mnt/speech1/nadira/stt/icefall/egs/librispeech/ASR$