BART: Abstractive Summarization

Video

Two Medium articles that I published for this project:

Abstract

The project is to fine-tune BART model facebook/bart-baset to do abstractive summaries for the news articles and youtube transcripts. The current state of the art model, BART was designed to do abstractive summaries but the paper doesn’t have a code base. According to my research most people just use fine-tuned models by importing transformer module, but I want to train for a new task like summarizing youtube transcripts like here: ChatGPT: Abstractive Text Summarization.

Also, to fine-tune BART model l needed a new dataset. I used 100 TedTalk transcripts to create a new dataset for fine tuning. I realized that creating a new dataset is time consuming and will need more time to complete.

The big part of this project was also learning different scoring methods for NLU tasks. To make the project more fun I decided to compare BART, chatGPT and human annotated summaries for 10 TedTalk transcripts.

Problem statement

I want to take a transcript and summarize it in 100 words or so. If one doesn’t want to watch a video or listen to audio one can just read a summary. I am assuming we already have a transcript for the videos/audios in order to create summaries.

Related work

I read two papers BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension and CTRLsum: Towards Generic Controllable Text Summarization. I decided to make my own dataset from TedTalk transcripts. I completed 100 but I would create more in the future. This project was to learn how BART model can be fine-tuned.

Methodology

I followed the BART fine-tuning tutorial that was written here

My code for song generation with fixed deprecated PyTorch methods is here.

The idea was to learn how to do fine tuning using pre-trained BART model and apply it to a new idea like summarization.

  • Create new dataset to fine-tune BART for audio/video transcript summarization
  • Learn NLU Scoring
  • Fine-tune BART model on a new dataset [partially done]

Experiments/evaluation

I experimented with many hyper parameters:

  • batch size
  • max_length of encoding
  • self.eval_beams
  • num_workers in dataloader
  • noise_percent
  • max_length of generated output

and got the following hyper parameters that worked for machine that I used.

  • batch size = 16
  • max_length of encoding = 512
  • self.eval_beams = 4
  • num_workers = 4
  • noise_percent = 0.25
  • max_length of generated output = 400

I evaluated the model in the end using BLUE, ROUGE and BERT Scores.

cnn_dailymail dataset
cnn_dailymail dataset[loads]
train
287,114
287,113
test
11,491
11,491
eval
13,369
11,490

Decoding time using fine-tuned BART facebook/bart-base

number of article
minutes
100
8
200
16

Time needed to train one epoch facebook/bart-base

training set
batches
epochs
time to tokenize
time to train
287,113
16
1
30 min
3 hours

Results

BLUE
R1
R2
RL
epochs
200 articles
0.11
0.35
0.13
0.20
1
200 articles
0.09
0.33
0.17
0.23
2

BERT score

P
R
F1
epochs
200 articles
0.86
0.86
0.86
1
200 articles
0.88
0.87
0.86
2

Results are not as good as in BART paper, but I only trained the model for 2 epochs. Future work is to train the model for 100 epochs but this can be very expensive. The model is performing surprisingly well after 2 epochs.

epoch 1 scores:

BART base blue_res:  {'bleu': 0.1133575031096052, 'precisions': [0.29523809523809524, 0.11650485436893204, 0.07920792079207921, 0.06060606060606061], 'brevity_penalty': 1.0, 'length_ratio': 1.2209302325581395, 'translation_length': 105, 'reference_length': 86}
BART base rouge_res:  {'rouge1': 0.3522702104097453, 'rouge2': 0.125288950531669, 'rougeL': 0.20310077519379843, 'rougeLsum': 0.20310077519379843}
BART base bertscore_res:  {'precision': [0.8616010546684265, 0.8683019876480103], 'recall': [0.8689004182815552, 0.8693030476570129], 'f1': [0.8652353882789612, 0.8688021898269653], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.25.1)'}

epoch2 scores:

BART base blue_res:  {'bleu': 0.09913951389093098, 'precisions': [0.36923076923076925, 0.19047619047619047, 0.09836065573770492, 0.05084745762711865], 'brevity_penalty': 0.7239181662133051, 'length_ratio': 0.7558139534883721, 'translation_length': 65, 'reference_length': 86}
BART base rouge_res:  {'rouge1': 0.33152958152958156, 'rouge2': 0.17022493328250093, 'rougeL': 0.2316017316017316, 'rougeLsum': 0.2316017316017316}
BART base bertscore_res:  {'precision': [0.8862956762313843, 0.8611895442008972], 'recall': [0.8729070425033569, 0.856441080570221], 'f1': [0.8795503973960876, 0.8588087558746338], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.25.1)'}

Examples

original article:

Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk. They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger. More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans. 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee. 'It is time that the DOT and FAA take a stand for humane treatment of passengers.' But could crowding on planes lead to more serious issues than fighting for space in the overhead lockers, crashing elbows and seat back kicking? Tests conducted by the FAA use planes with a 31 inch pitch, a standard which on some airlines has decreased . Many economy seats on United Airlines have 30 inches of room, while some airlines offer as little as 28 inches . Cynthia Corbertt, a human factors researcher with the Federal Aviation Administration, that it conducts tests on how quickly passengers can leave a plane. But these tests are conducted using planes with 31 inches between each row of seats, a standard which on some airlines has decreased, reported the Detroit News. The distance between two seats from one point on a seat to the same point on the seat behind it is known as the pitch. While most airlines stick to a pitch of 31 inches or above, some fall below this. While United Airlines has 30 inches of space, Gulf Air economy seats have between 29 and 32 inches, Air Asia offers 29 inches and Spirit Airlines offers just 28 inches. British Airways has a seat pitch of 31 inches, while easyJet has 29 inches, Thomson's short haul seat pitch is 28 inches, and Virgin Atlantic's is 30-31.

Human Annotated Summary:

Experts question if packed out planes are putting passengers at risk .
U.S consumer advisory group says minimum space must be stipulated .
Safety tests conducted on planes with more leg room than airlines offer .

BART results version 1:

The shrinking space on planes is putting our health and safety in danger. Last week, a consumer group set up a public hearing. They are questioning if having such packed out seats is putting passengers at risk.

BART results version 2:

A U.S consumer group set up by the Department of Transportation said it will hold a public hearing this week. They are questioning whether having such packed seats is putting passengers at risk.

BART results version 3:

U.S consumer advisory group set up by the Department of Transportation at a public hearing that.
while the DOT is happy with standards for animals flying on planes, it doesn't provide a fair amount of space for humans.
'In a world where humans have more rights to space than humans,' said the committee.
United Airlines has 30 inches of room, Gulf Air has 29 inches, Asia offers 29 inches and Spirit Airlines offers 28 inches.

BART results version 4:

Consumer advisory group set up by the Department of Transportation.
Said that while the government is trying to set standards for animals on planes,.
it doesn't stipulate the amount of space humans have.
'In a world where animals have more rights to food and space than humans,' they said.

Output for 200 test articles are here:

image

Video [same as above]:

Other work:

Create new dataset

For this project I decided to work on a new TEDTalk dataset:

  • I found online 2750 human annotated TedTalk summaries and Kaggle datset with transcripts and decided to make a dataset for abstractive text summarization.
  • Collected transcripts for 100/2750 summaries
  • Collected chatGPT results on 100/2750 transcripts
  • Collected BART summaries on 100/2750 transcripts
  • Future work: get summaries from new fine-tuned model
  • Compare all of the above

NLU Scoring for Abstractive Summarization

There are multiple evaluation metrics for NLU tasks and most authors use 2 or 3 scores in combination to draw conclusions. For extractive summarization authors are reporting ROUGE and PPL scores. For abstractive summarization there are no good metrics to evaluate the model.

  • Original BART paper done by Facebook is using R1 R2 and RL. [1]
  • The paper for CTRL-SUM project that was done by Salesforce is using ROUGE-1/2/L and BERTScore together. [2]

Understand scoring methods

ROUGE-1 how many words two strings have in common

ROUGE-2 how many bi-grams two strings have in common

ROUGE-L is based on the longest common subsequence (LCS)

BertScore - instead of exact matches, BertScore computes token similarity using contextual embeddings.

I created TEDTalkSum that has 10 youtube transcripts and abstractive summaries. Scored BART bart-large-cnn model for abstract summarization.

NLU Scores for 10 TEDTalk transcripts using bart-large-cnn model

R1
R2
RL
BART
0.23
0.0
0.10
chatGPT
0.19
0.018
0.13

BertScore for 10 TEDTalk transcripts

P
R
F1
BART
0.82
0.84
0.82
chatGPT
0.84
0.82
0.83

BART FineTuning on Rap Songs Results

I found few projects online and the most promissing one was fine-tuning BART model to generate songs: Teaching BART to Rap: Fine-tuning Hugging Face’s BART Model. I managed to fine-tune BART by following the tutorial. Below you can see few rap song samples that finetuned facebook/bart-base model generated.

"facebook/bart-base”, self.eval_beams = 4, epochs=1

| Name  | Type                         | Params
-------------------------------------------------------
0 | model | BartForConditionalGeneration | 139 M 
-------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
557.682   Total estimated model params size (MB)

"facebook/bart-base”, eval_beams = 20, epochs=1

You and me forever cruising city lights
You and me with the city lights
I'm in the middle of the night, I'm forever cruising
You and me on the beach, cruising city lights

"facebook/bart-base”, eval_beams = 20, epochs=2

new_song = generate_lyrics(seed_line = "You and me forever cruising city lights", num_lines = 8, model_ = model,noise_percent = 0.25, multiple_lines = False, max_line_history = 1)
You and me forever cruising city lights
You and you and you
I’m the only one that’s ever seen the lights
You're the only one that can save me forever
You and you and you
I’m the only one that’s ever seen the lights
You're the only one that can save me forever
And you and you

"facebook/bart-base”, eval_beams = 20, epochs=2

You and me forever cruising city lights
You and me, we just cruising city lights
You and me, you and me
And I don’t need you, I just need you
We just cruising city lights
You and me
I need you, I just need you
We just wanna see the city lights

What didn’t work

My original goal was to use the larger trained model facebook/bart-large-cnn

but after a day I gave up on fine-tuning it as the number of parameters were not matching somewhere in the model architecture. As we can see below this model has 406M parameters. 3 times larger than the base BART model.

Future work:

  • experiment with facebook/bart-large-cnn
  • train on 1024 words
  • train for 100 epochs, 3*100 = 300 hours of training
  • move code from Colab to AWS linux machines.
  • bart-large-cnn model has 406M parameters and we can fine-tune this model with our new dataset
| Name  | Type                         | Params
-------------------------------------------------------
0 | model | BartForConditionalGeneration | 406 M 
-------------------------------------------------------
406 M     Trainable params
0         Non-trainable params
406 M     Total params
1,625.162 Total estimated model params size (MB)

References:

  1. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
  2. BERTScore: Evaluating Text Generation with BERT
  3. CTRLsum: Towards Generic Controllable Text Summarization
  4. Two minutes NLP — Learn the ROUGE metric by examples
  5. Get To The Point: Summarization with Pointer-Generator Networks https://arxiv.org/pdf/1704.04368.pdf
  6. TedTalk transcripts https://www.kaggle.com/datasets/rounakbanik/ted-talks
  7. TedTalks Summaries https://singularityhub.com/2009/10/14/master-list-of-500-ted-videos-with-summaries/
  8. https://colab.research.google.com/drive/1cpV6iYkzwG94fId2Zlrjlaz97ON93c0K#scrollTo=_h8QhLcyh9RJ
  9. Teaching BART to Rap: Fine-tuning Hugging Face’s BART Model https://towardsdatascience.com/teaching-bart-to-rap-fine-tuning-hugging-faces-bart-model-41749d38f3ef
  10. some transcripts were taken from: https://www.kaggle.com/datasets/rounakbanik/ted-talks
  11. Colab code: