Estonian (longer) text summarization

11 min readApr 30, 2024

Self-hosted summarization models for smaller languages

What and why

The goal of this experiment is to evaluate how well different transformer-based models perform in Estonian text summarization. While it’s possible to use a large language model (LLM) like GPT-3.5 for this task, I want to explore smaller models that can be hosted in my own environment. In NLP, English still dominates, so methods effective for this language might not work as well for smaller languages.

I’ll attempt two things:

Evaluate models’ performance on Estonian texts: Estonian is a smaller language with limited training data, so identifying models that handle it well is crucial.
Increase input text size: I aim to test if input size can be increased from the usual 1024 tokens to 2048 tokens. Since summarization is often needed for longer texts, increased input size could prove beneficial.

Practical constraints include training models on a GPU with 24GB of memory. Ideally, inference should be faster and consume fewer resources than running an Estonian-capable LLM locally.

Experiment notebooks are available here.

Datasets

I use 3 different Estonian datasets :

https://huggingface.co/datasets/TalTechNLP/samsum_ee which is probably translation of https://huggingface.co/datasets/samsum. This dataset consists of textual summaries of dialogues and was easy to start with since it already existed.
https://huggingface.co/datasets/TalTechNLP/LongSumEt which is an Estonian long summarization dataset with pages filtered from CulturaX dataset. The dataset consists of the text, and machine generated short summary, long summary and bullet points.
https://huggingface.co/datasets/rristo/et_parliament_stenos_summary is a toy dataset I created. It consists of short summaries of Estonian parliament dialogues. The input text is designed to not exceed 2048 tokens (for the MBart tokenizer). Summaries are in bullet point form (up to 3 per text) and were generated via GPT-3.5 for this experiment.

Both SamSum EE and LongSum Et may have input texts longer than 1024 tokens and summaries longer than 512 tokens. During training, longer texts were often truncated. Therefore, initial experimental results may have been slightly lower due to this truncation, though the effect should be relatively small.

Techniques

I use some techniques to enable models to accept longer context windows and/or reduce vocabulary, as needed for specific contexts:

Use LSG layers. The most computationally heavy part of a transformer architecture is its attention mechanism, which scales quadratically with sequence length (for each input token, attention is calculated with all other input tokens). This means that a sequence length 𝑥 requires 𝑥² computations in the attention mechanism. LSG (Local, Sparse, and Global) layers use different types of attention — Local, Sparse, and Global — to capture different aspects of the attention mechanism without a quadratic increase in computational cost compared to full attention. In the following picture, every element in the matrix would be calculated for full attention, LSG uses only small part of the elements.

LSG attention pattern. (Source: https://arxiv.org/pdf/2210.15497)

Vocabulary trimming. I want to create summaries only in Estonian. Multilingual pre-trained models typically have extensive vocabularies, from which most tokens are not used in every language the model supports. To optimize these models, we can remove unused tokens, thereby making the models smaller. There is a package for this purpose, supporting many model types: hf-trim. The most crucial part is keeping existing token weights intact and maintaining correspondence between the model’s and tokenizer’s token indexes. Smaller models consume fewer resources, which might also enable the use of longer context windows for training.

Models

I used a variety of pre-trained models, with some showing more promising results and thus being experimented on more extensively. The list of base models used includes:

google/mt5-small. T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks, with each task converted into a text-to-text format. mT5 is a multilingual version of T5, pre-trained on the mC4 corpus, covering 101 languages, but excludes any supervised training tasks. The PyTorch model binary size is 1.2 GB.
google/mt5-base. Similar to the mT5-small, but with a PyTorch model binary size of 2.33 GB.
google/umt5-small. Similar to T5, but uMT5 aims to create a more cohesive model that not only processes multiple languages, but also enhances interchangeability and transfer learning capabilities across languages. The PyTorch model binary size is 2.37 GB.
agemagician/mlong-t5-tglobal-base. he LongT5 model is an extension of the T5 model, and it enables the use of one of two efficient attention mechanisms: Local attention or Transient-Global attention. This makes handling long input sequences more efficient (and doesn’t need LSG layers). This model is pre-trained on a multilingual corpus, and its PyTorch model binary size is 2.37 GB.
facebook/mbart-large-cc25. This is a multilingual BART (Bidirectional and Auto-Regressive Transformers) model designed specifically for sequence-to-sequence tasks, such as summarization. The PyTorch model binary size is 2.44 GB.

Metrics

In this experiment, the most widely used metric is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This metric measures the overlap of n-grams between the computer-generated and reference texts. In these experiments, the output maximum sequence (summary) length was fixed at 512 tokens. Specifically, the metrics used include:

ROUGE-1 (rouge1): Measures the overlap of unigrams (single words) between the generated text and the reference text. It is calculated by comparing each word in the generated summary to the words in the reference summary and counting matches. This metric provides a basic measure of content overlap.
ROUGE-2 (rouge2): An extension of ROUGE-1, this measures the overlap of bigrams (pairs of consecutive words) between the generated and reference texts. This metric provides a more detailed measure than ROUGE-1, as it considers the sequence of words, thereby capturing more of the text’s structure.
ROUGE-L (rougeL): Measures the longest common subsequence (LCS) between the generated text and the reference text. The LCS does not require consecutive words to match but must appear in the same order in both texts. ROUGE-L is useful for evaluating the coherence and order of the content in the summaries.
ROUGE-Lsum (rougeLsum): Similar to ROUGE-L, but applied to each sentence of the document separately before averaging the scores. This variant is particularly useful in scenarios where the summary consists of multiple sentences, allowing for evaluation at a sentence level rather than as a continuous piece of text.

Results

Overall experiments results could be seen on the following table.

Let’s analyze it through some aspects.

Vocabulary trimming

The effect of vocabulary reduction on metrics can be most clearly seen in the mlong-t5-tglobal-base model, which was trained on Parliament stenogram summaries both with and without reduced vocabulary.

mlong-t5-tglobal-base model vocabulary reduction effect on metrics

The graph indicates that reducing the vocabulary slightly decreases model accuracy. For example, the ROUGE-1 score decreases from 36.82 to 36.16. This reduction is not significant and does not have a major impact on the results. A similar trend of minor decreases can also be observed in other models. Although there is a slight loss in accuracy, the reduced vocabulary has significantly decreased the number of token embeddings required by the model, from 256,300 to 18,171.

Usage of LSG layer

The usage of the LSG layer is best demonstrated with the mbart-large-cc25 model. This model was trained on the samsum_ee dataset, both with an LSG layer (allowing for a maximum sequence length of 2048) and without an LSG layer (limiting the maximum sequence length to 1024).

mbart-large-cc25 model LSG layer usage effect on metrics

As seen from the previous graph, adding an LSG layer might slightly decrease model accuracy (for instance, ROUGE-1 decreases from 37.91 to 37.05), but this reduction is minimal. This decrease in accuracy is expected, as the model no longer calculates attention between all input tokens, thereby losing some information. However, we gain significant efficiency, allowing us to process input sequences twice as long.

Summary type: text vs bulletpoints

From the following graph, we can see that mT5 and mBART tend to perform better on textual summaries compared to bullet points. This indicates a pattern where ROUGE scores for textual summaries are generally higher than for bullet points. Note that not every model was trained on the bullet points summary, and the models were trained on different datasets, allowing us to detect general trends rather than exact comparisons. Models that were not in the top 2 for textual summaries were not chosen for bullet points summaries.

The graph also shows that mBART tends to be more accurate than mT5-small and uMT5-small, and has similar accuracy to mT5-base. On the bullet points dataset, the mT5-base model tends to be more accurate. Based on this, I selected the mBART model to increase the input text length and used the mlong-t5-tglobal-base model, which already supports longer sequences through a mechanism similar to the LSG layer.

Two best models for longer sequences

Two models showed promising results for bullet points: mBART and mlong-t5-tglobal-base. In the following graph, we can see that mlong-t5-tglobal-base is a clear winner. For instance, the ROUGE-1 score for mlong-t5-tglobal-base is 36.17, while for mBART it is 27.99. Other metrics follow a similar pattern.

MBart vs mlong-t5-tglobal-base accuracy on parliament dataset, maximum sequence length 2048 tokens, output type was bulletpoints.

The reason for such a significant difference may be that the training dataset is not very extensive (around 3500 texts), and mBART might benefit from having more training data. Additionally, mBART may not be as well-suited for bullet points summaries. The Parliament dataset includes texts of various sizes and, in most cases, summaries with three bullet points. The mlong-t5-tglobal-base model, which is similar to an LLM and can be used for various tasks, might be more flexible in its output. However, these are hypotheses that require further testing.

Examples

Here I’ll demonstrate some examples of MBart and mlong-t5-tglobal-base summaries on riigikogu test data.

Text 1

mlong-t5-tglobal-base:

- Sven Sester annab ülevaate riigieelarve seaduse eelnõu ettevalmistamisest teiseks lugemiseks.
- Eelnõu eesmärk on Euroopa Nõukogu direktiiv 2011/85/EL liikmesriikide eelarveraamistiku miinimumnõuete kohta üle võtta.
- Muudatusettepanekud tehti järgmisel istungil: piirang keskvalitsuse juriidilise isiku asutamisel, ümberkorraldamisel ja lõpetamisel, tasakaalureeglid, kompenseerimismehhanismi leevendamine, eelarvepositsiooni püstitamine, eelarvepositsiooni eesmärgi püstitamine.

mBart:

- Sven Sester annab ülevaate riigieelarve seaduse eelnõust ja muudatustest 
- Sotsiaaldemokraatlik Erakond esitab mitmeid muudatusettepanekuid seoses riigieelarve koostamise ja tasakaalustamisega

GPT-3.5 summary:

- Sven Sester annab ülevaate riigieelarve seaduse eelnõu ettevalmistamisest teiseks lugemiseks 
- Eelnõu eesmärk on üle võtta Euroopa Nõukogu direktiiv 2011/85/EL 
- Rahanduskomisjon tegi mitmeid muudatusettepanekuid eelnõule

Text 2

mlong-t5-tglobal-base (notice the repetition of numbers):

- Sven Sester selgitab Eesti Panga seaduse muutmise seaduses esitatud muudatusettepanekuid ja muudatusi
- Muudatusettepanek nr 17, 19, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51, 52, 53, 54, 55, 56, 57, 58, 59, 51

mBart:

- Sven Sester selgitab lühiajalist laenu võtmise tingimusi ja Eesti Panga hinnangut. 
- Muudatusettepanekud riigi eelarvestrateegia ja kriisiolukordade kohta.

GPT-3.5 summary:

- Sester räägib Eesti Panga hinnangust seaduseelnõule 
- Arengukavade rahastamisotsused ja valitsemisalakesksuse süvendamine 
- Põhiseaduskomisjon ja Riigikogu roll eelnõu küsimustes

Text 3:

mlong-t5-tglobal-base:

- Sven Sester soovitab riigi eelarvestrateegia tuua Riigikogus arutusele, et muuta iga-aastase eelarve seaduse seletuskirja õiguslikku tähendust ja kirjeldada otsustusõigus riigi jaoks oluliste pikaajaliste eesmärkide kehtestamisel.
- Komisjon käsitles töötuskindlustusmakse määrade kehtestamise regulatsiooni ja Euroopa Keskpanga arvamust eelnõu kohta.
- Rahanduskomisjon koostas muudatusettepanekuid riigi eelarvestrateegia süsteemi regulatsioonide täiendamiseks ning otsustas võtta eelnõu teiseks lugemiseks täiskogu istungi päevakorda 15. jaanuaril.

mBart:

- Sven Sester soovitab riigi eelarvestrateegiat arutada Riigikogus 
- Oluline on tagada eelarve täitmise aruandes esitatud arvandmed ja täpsed nõuded valitsuse esitatavate tegevuskavade ning stabiliseerimisreservi moodustamise ja kasutamise kohta 
- Komisjon langetas konsensuslikud otsused seoses töötuskindlustusmakse muudatustega riigieelarve seaduse eelnõus

GPT-3.5 summary:

- Sven Sester soovitab muuta riigieelarvega eraldatava raha kasutamise sisulisi ja mõõdetavaid eesmärke, tuua riigi eelarvestrateegia Riigikogus arutusele ning täpsustada seletuskirja struktuuri ja nõudeid valitsuse tegevuskavadele. 
- Riigikontroll leiab, et aruandlus riigieelarve kulutuste tulemuslikkuse kohta peab olema informatiivsem ja arusaadavam, ning on oluline seoste selgus raha ja tulemuste vahel. 
- Rahanduskomisjon arutas töötuskindlustusmakse määrasid, Eesti Panga rolli ning riigi eelarvestrateegiat, ja tegi otsuseid eelnõu edasiseks menetlemiseks.

MBart does not make bad summaries but they tend to be shorter than mlong-t5-tglobal-base ones. Both models might hallucinate/create summaries that repeats text. MBArt example of failed summarization:

'- Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Rain Epler, Raimond Kaljulaid, Rain Epler, Raimond Raipler, Rain Epler, Rain Epler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler, Raipler'

Discussion and conclusion

In conclusion, there are a few options for (longer) Estonian text summarization. My findings are:

LSG layers: These are helpful for increasing input sequence length.
mBART and mlong-t5-tglobal-base: Both models have strong multilingual capabilities.
Performance on bullet points summaries: The mlong-t5-tglobal-base model outperforms mBART on bullet points summaries, although mBART is not far behind. mBART tends to generate shorter summaries and focuses on different aspects.

The final mBART model, which had a maximum input sequence length of 2048 tokens, reduced vocabulary, and was trained on the Parliament dataset, had a PyTorch binary size of 1.4GB. A similar model trained on the mlong-t5-tglobal-base had a PyTorch binary size of 905MB.

Best model trained on mlong-t5-tglobal-base is available here.

Afterword

I also tried to quantize the mlong-t5-tglobal-base model, reducing its size to 287 MB, but this had a significant effect on accuracy. For instance, ROUGE-1 dropped from 36.17 to 31.48, and ROUGE-Lsum from 33.76 to 29.60. Additionally, text generation had more grammatical mistakes, and sometimes the meaning of the bullet points was difficult to understand. Although mBART had lower metrics, its summaries were more accurate and understandable. This highlights the importance of not relying solely on metrics and encourages reviewing some summaries manually.

Sources used

hf-trim, https://github.com/IamAdiSri/hf-trim
LSG Attention: Extrapolation of pretrained Transformers to long sequences, https://arxiv.org/pdf/2210.15497
Mastering ROUGE Matrix: Your Guide to Large Language Model Evaluation for Summarization with Examples https://dev.to/aws-builders/mastering-rouge-matrix-your-guide-to-large-language-model-evaluation-for-summarization-with-examples-jjg

Estonian (longer) text summarization

What and why

Datasets

Techniques

Models

Metrics

Results

Vocabulary trimming

Usage of LSG layer

Summary type: text vs bulletpoints

Two best models for longer sequences

Examples

Discussion and conclusion

Afterword

Sources used

Written by Risto Hinno

No responses yet