Introduction
Text summarization (TS) is a process of shortening a text document while maintaining as much information as possible. The goal of TS is to get rid of nonessential content and condense the crucial points of a given text into an easily accessible form. Compressing written information is an essential step towards efficient work with textual data [1].
Abstractive text summarization (ATS) groups those automated text summarization (TS) strategies that usually implement semantic methods and language generation techniques to create a shorter re-phrased summary, an abstract, from scratch in a similar manner as humans approach the task [2]. Extractive text summarization instead tries to choose the most representative sentences from the input text and return them as a so-called extract. Although ATS shows less stable results when compared to the extractive approach, recent progress in text generation increased the abstract quality, and the summaries generally tend to be more human-like [3], [4].
The current state-of-the-art results in the summarization task are achieved almost exclusively by Transformer-based language models [1], [5]. This paper presents a detailed description of the process of training a new GPT-2 generative model [6], [7], denoted as CzeGPT-2, for Czech as a representative of non-mainstream languages. The model is evaluated with the task of abstractive text summarization using the standard ROUGERAW metric and the Czech benchmark dataset named SumeCzech [8] consisting of one million newspaper articles.1 CzeGPT-2 significantly surpasses the SumeCzech published baselines and its results are comparable to those of the 4-times larger multilingual pretrained mBART large model [9] with state-of-the-art results here. We also incorporate manual error analysis that is more subjective than the conventional ROUGERAW but helps us better understand the model behavior and the mechanism of mistake generation.
It is to be noted that our task here is not to compare with the largest current models such as GPT-4 by OpenAI [10] or the Gemini model by Google [11], due to the massive difference in computational costs between these cloud giant models and the presented in-house freely distributed model. By thorough evaluation, we show that for a new language, it is often sufficient to train a new model with fewer parameters than to rely purely on cloud-based commercial systems.
The research highlights of the presented article are as follows:
Detailed description of the process of training a new freely available GPT-2 language model for a language without freely accessible large coverage language model.
In the summarization task evaluation, the presented CzeGPT-2 model results are comparable with 4-times larger multilingual model, resulting in savings in GPU and power usage.
Human annotations of the summarization results offer detailed error analysis of the model capabilities and reveal inadequacies of the ROUGERAW metric.
Related Works
In this overview, we concentrate on the process of training a GPT-2 generative transformer model for a new language with Czech taken as a representative. Researchers have made just a few attempts to create a Czech abstractive summarizer so far. This may be because previous methods were unable to provide satisfactory results even for widely used, morphologically less rich languages, such as English [12], [13].
A. SumeCzech
The first big step for Czech abstractive summarization came in May 2018 with the SumeCzech summarization dataset [8]. The dataset paper also introduced an innovative evaluation system - ROUGERAW – which is more suitable for free-word-order languages than the original ROUGE metric. The proposed dataset comprises about one million newspaper articles scraped from Czech news websites. Each article consists of a headline, an abstract of several sentences, and a full text, which implies three summarization tasks: text
SumeCzech was accompanied by three unsupervised and three supervised extractive summarization techniques and three abstractive models (one for each task). The abstractive summarizers were the first attempt to train a Transformer model (the vanilla neural machine translation architecture) for Czech summarization. The performance, though, did not exceed the extractive methods.
B. Named Entities
Following the SumeCzech methods, in April 2021, a research group from CTU in Prague came up with the idea of integrating information about the presence of named entities into the summarization process [14]. The authors tried to improve the results on a one-sentence summary task (text
C. Fine-Tuned mBART Model
In May 2022, M. Krotil from CTU Prague published experimental results [15] of fine-tuning the mBART-large [9] multilingual model with the SumeCzech dataset and a private dataset of the Czech News Center of about 750,000 documents. The large pretrained model significantly improved the precision of the generated summaries, although, interestingly, the model recall did not reach some of the baseline values.
Methods
The training of Transformers for Czech has so far been a matter of only a few projects [8], [16], [17], [18], [19], [20]. To our knowledge, no project has yet pre-trained an autoregressive2 decoder-only Transformer, which is very important for numerous tasks based on text generation.
A. Architecture
The decoder-only Transformer is often represented by the GPT models family that Radford et al. have been proposing since 2018 [6], [22], [23]. The only significant architectural difference across the three GPT generations is the increasing size of the models. In principle, the topology of the networks has not changed.
Input preparation: Before feeding an input sequence into the Transformer, we must tokenize the text into words and subwords. Then, we replace the tokens with their assigned token IDs, which the model substitutes for the corresponding embeddings. Embedding is a fixed-sized trainable vector representation of a token within the model.
The Transformer, from its nature, does not recognize the order of the input tokens, which is crucial for language modeling. We provide the information using the positional encoding vectors that correspond to each position of the input. The positional encodings are summed with the input token embeddings, and the result is sent to the network.
1) Decoder
As the name suggests, the decoder-only Transformers utilize only the decoder half of the Transformer encoder-decoder architecture. A decoder is composed of a stack of decoder blocks followed by a text prediction block working over the model vocabulary (see Figure 1 and the Tokenizer section below for details). The text prediction block consists of a linear and a softmax layer.
The objective of the decoder stack is to generate an output sequence autoregressively based on the input and the previously generated tokens. The decoder block comprises a (masked multi) self-attention layer, which ensures that the model does not incorporate the information about future tokens into the prediction during training, normalization layers, and feed-forward layers that share weights on every position within the decoder but are independent across the blocks.
The linear and softmax layers create a probabilistic distribution over the token IDs. Based on the distribution, we choose which token will be appended to the forming output sequence. For the generation to work well, we need to introduce a certain level of randomness. Too high as well as too low randomness deteriorates the final result. Different hyperparameters such as top-k or top-p are used to balance the behavior [24].
B. Model
As a foundation for CzeGPT-2, we have chosen the GPT-2 small model with 117 million trainable parameters. That means we have 1024 tokens long input/output sequence, embeddings of size 768, 12 decoder blocks, and 12 attention heads per block. The model is both reasonably large for the main summarization task and small enough to run on weaker GPUs or even CPUs.
The implementation was supported by open-source libraries with a robust training environment3,.4
1) Tokenizer
GPT-2 uses a Byte-level byte pair encoding (BBPE [25]) tokenizer that has to be trained on textual data first. CzeGPT-2 tokenizer was trained on the full plain-text pre-training dataset (see Section III-D) using the Hugging Face Tokenizers library.5 The vocabulary size was set to 50 257, which corresponds to 256 bytes in the initial alphabet, 50 000 available slots for learning, and one end-of-document token. The chosen vocabulary size is usually regarded as sufficient [26] to cover all frequent words and word prefixes and suffixes. In case of CzeGPT-2, the texts used for tokenizer training contain Czech texts only which improves the model’s capabilities for processing new texts in this language.
C. Metrics
Measuring the quality of a pre-trained language model is not easy. Usually, the best choice is to select a downstream task and see how the model performs, but in the case of summarization, the evaluation is yet again quite challenging.
If we want to state a single quality score, the most common metrics for generative language models are perplexity and accuracy. And for the free-word-order language summarization task, the ROUGERAW [8] is the current standard.
1) Perplexity
Perplexity is the benchmark metric for autoregressive models that predict a token based only on the preceding sequence. We can calculate the perplexity either by cross-entropy (\begin{equation*}\text {PPL}(X) = \text {exp}\left \{{-\frac {1}{t} \sum _{i=1}^{t} \text {log } p_{\theta} (x_{i}|x_{ < i})}\right \}\end{equation*}
The metric intuitively illustrates the uncertainty of the language model when predicting the next token. We can understand it as an expression of the model’s ability to predict a token from a fixed vocabulary stably. This mainly means that the tokenization process directly affects perplexity, which we should take into account when comparing different models, especially for different languages. Otherwise, when using the same tokenizer, the lower the perplexity goes, the better [28], [29].
2) Accuracy
Along with perplexity, the value of accuracy is often stated. This metric expresses the share of correctly predicted tokens in the output sequence [30].
Even though the metric has its positives as a straightforward interpretability or given range of values, it does not describe the capabilities of the language model thoroughly since it does not consider predicted probabilities for other tokens than those included in the output [28].
3) ROUGERAW
The original ROUGE [31], [32] is an English-specific set of metrics and a software package that measures the similarity between generated and target summaries according to overlaps between them. The technique is based on English stemming, stop-words, and synonyms, where specifically the stemming part is too aggressive for morphologically rich languages, and stop-words and synonyms condition the metric results on extra data for each new language.
The ROUGERAW metrics proposed with the SumeCzech dataset do not include any additional language-dependent steps, so they are language-agnostic [8]. There are several types of the ROUGERAW metric depending on what overlaps we compute. Usually, it is either the overlap of
The suggested variants for Czech summarization in SumeCzech are ROUGERAW-1, ROUGERAW-2, and ROUGERAW-L. Each variant is evaluated via its Precision, Recall, and F1-score, which support detailed interpretation. The F1-score as the harmonic mean of Precision and Recall is the most indicative of the three since it is robust against varying lengths of the summaries.
D. Data and Training
The summarizer training procedure comprises two steps. First, we pre-train the model on unlabeled data with a broad domain to improve the world knowledge of the model. Next, we fine-tune the neural network on an annotated dataset for the abstractive summarization task.
1) Initialization
Inspired by Pierre Guillou’s experiment with Portuguese [33], we have evaluated several techniques for the model’s embedding vector initialization. This kind of approach often accelerates the training of neural networks. We tried to map pre-trained embeddings of common tokens from the English GPT-2 vocabulary or an initialization using FastText [34] model embeddings trained specifically for this purpose on 5GB plain-text corpus. None of these techniques significantly improved the pre-training speed.
2) Pre-Training
The first phase of training CzeGPT-2 should provide a general overview of the Czech language to the model, especially syntactic and semantic relationships between words.
The CzeGPT-2 pre-training text dataset was based on the largest Czech corpus csTenTen17 [35], [36]. The dataset is composed of Czech documents crawled from the internet, including Czech Wikipedia. During post-processing, the corpus was deduplicated, and other languages were filtered out. Overall, the dataset6 comprises 12.5 billion tokens. For the pre-training itself, we used a 5GB random slice from the corpus. We split the data to train/test/validation sets with the ratio of 90: 5: 5.
The model was pre-trained for 135 hours on an A100 GPU card using the
Data length distribution of the train set; all lengths were rounded down to hundreds. Orange bars denote suitable data points. Less than 0.6% of data are longer than 3000 tokens and are not displayed.
3) Fine-Tuning
The CzeGPT-2 fine-tuning was performed with the SumeCzech summarization dataset [8] that is composed of about one million newspaper articles divided into train, validation, test, and out-of-domain (OOD) test sets. The OOD test set is a cluster of 4.5% in size extracted with the K-Means algorithm, and the rest of the data was divided into train/test/validation using an 86.5: 4.5: 4.5 split.
Because the CzeGPT-2 model has a fixed-sized input layer that can take only 1024 tokens, we need all the data points to have a maximum article length and abstract length together with 1023 tokens, with one token left for a separator. Statistics of the content lengths reveal that 87.9% of the data meets these requirements (see Figure 2). What we also discovered is that the length distribution is nearly equal across all splits. This means that the evaluation is not negatively impacted by a different length of test inputs compared to the training data.
The fine-tuning process was implemented using the Hugging Face Transformers7 library. We trained the model for 100 hours (15 epochs) on an A100 GPU, but the crucial drop of evaluation loss happened in the first six epochs. This time, the maximal possible batch size was 4, so we increased it with gradient accumulation to 64. We did not use the 1cycle policy; the learning rate was maintained by the Adam optimizer.
From the general pre-trained model, two summarizers were fine-tuned - one for the text
To separate the two parts of the input, a special token
In the case of fine-tuning, we aim not to model the language in general but more specifically to the final task. Therefore, we want to punish the network only for its mistakes while summarizing, not generating the rest of the article. For this purpose, we mask the target labels so only the summary tokens and separators contribute to the error function.
Also, for the abstract generator, we decided not to train the embedding of the
Later, based on tuning on the validation set, we decided to generate three-sentence abstracts and use top-k of 50 and top-p of 0.5.
Results
For evaluating the CzeGPT-2 summarizer, two approaches have been used. The first is the automatic evaluation using the ROUGERAW software package provided by authors of the SumeCzech, and the latter is a detailed error analysis produced manually by human annotators on a subset of the data.
A. ROUGERaw Evaluation
Apart from the standard test set, SumeCzech provides an out-of-domain test set composed of articles with a different topic than the rest of the partitions. Since the models cannot accept inputs longer than 1024 tokens, we had to filter approximately 12% of the data for abstract and 9% of the data for headline generation.
The ROUGERAW results of CzeGPT-2 with the entire test set and the out-of-domain test set compared to the approaches of [8], 8 the named-entities method [14], the mT5-small multilingual model by Google Research [38] fine-tuned with SumeCzech,9 and the fine-tuned mBART model [15], mentioned in Section II, are in Tables 1 and 2.
1) Text $\rightarrow$
Abstract
In the abstract generation task, the CzeGPT-2 model outperformed all the SumeCzech baselines including the fine-tuned mT5-small model and achieved stable results across precision, recall, and F1-score. This stability is crucial because it means that the length of the summary does not bias the score - short summaries tend to have larger precision with lower recall, and longer summaries the other way around. F1-score is robust against this behavior. This may be the reason why both the TextRank summarizer and CzeGPT-2 have reached higher recall of ROUGERAW-1 and ROUGERAW-L than the state-of-the-art mBART large model (see Table 1).
With the OOD test set, the CzeGPT-2 model moves the bar above the baselines for almost all metrics, too. The deterioration of the result is noticeable in the unknown domain, but the model apparently generalizes without issues.
2) Text $\rightarrow$
Headline
In the headline generation task, the CzeGPT-2 model also did a very good job. It beats the Named Entity RNN summarizers and beats all the compared state-of-the-art results except the pretrained mBART large model for all metrics on both test and OOD test sets (see Table 2). Examples of the gold and generated headlines are presented in Table 3.
B. Error Analysis
Even though ROUGERAW is the best metric for summarization we have right now, it is far from ideal. ROUGERAW only tells us if the model used the same words in some form as the ground truth. Unfortunately, with the advent of abstractive summarizers, factual and grammatical errors occur in the output, which ROUGE cannot reveal. With the advancements in the largest current models such as GPT-4 [10] or Gemini [11], the discourse coherence can be assessed by processing the summarization results by these models [40]. A disadvantage of this approach lies in the increased evaluation price and still reduced capabilities in judging subtle factual errors when compared to human evaluation.
To see how good summaries our model actually generates, what are the most frequent types of errors and how these errors arise, we decided to perform a manual annotation of a subset of the generated summaries and classify the mistakes.
1) Methodology
We use the methodology suggested by [39] that assigns each error a category in two dimensions denoted as mapping and meaning.
Mapping describes the surface level – what mechanism the model used to compose the erroneous sentence, what words or phrases it combined or omitted. This dimension reveals the source of a mistake and can help us avoid it. The four mapping categories and their definitions are listed in Table 4.
The second dimension, meaning, focuses on the effect of the error. It tells the impact of the error on the syntax, semantics, and meaning of the sentence. The Meaning dimension is divided into two subdimensions – malformed and misleading – and each of them has three categories (see Table 5). The annotators choose only one of these six options for each error.
2) Course of the Analysis
To cover all aspects of the summarization dataset in an annotation subset, we have identified four groups of the generated data – the abstracts in the test set, the abstracts in the OOD set, the headlines in the test set, and the headlines in the OOD set. From each of these groups, we took the best 15 and the worst 15 summaries for the purpose of the annotation of errors. The selection was made based on the ROUGERAW-1 F1-score, so we can also inspect which summaries are considered good and bad by ROUGERAW and how this is reflected in the actual errors.
After this step, we were left with
The annotators were provided with an input text (the whole article), the golden summary (original abstract or headline), and the generated summary divided into sentences. Since they did not evaluate the summary’s quality and only searched and categorized the errors, the presence of the golden summary was possible and could offer the annotators better orientation in the input text.
The raters went through all the sentences of all generated summaries, and for each, they had two options. Either they found an error and classified it in both the mapping and meaning dimensions, or they picked one of the so-called Special cases. The Special cases were Sentence missing, OK, and Repetitive (otherwise OK). The first value was for the cases when the number of sentences was incorrect, OK was used when no error occurred, and Repetitive was added for situations when the sentence is entirely correct, but it says exactly the same as one of the previous sentences in the generated summary. Such a scenario was not covered in the original methodology, and it has proved helpful in a few cases. The raters were encouraged to explain their error classification using a text box prepared for this purpose within the answer table.
The final error class was decided by a majority of votes. In the case of a draw, we inspected the sentence once again and tried to find an agreement. A generated abstract example with Semantically implausible sentence annotation can be seen in Table 6.
We used the Qualtrics10 platform to create and distribute the analysis. At the beginning of each set of summaries, we provided our detailed guidelines of the methodology to the annotators.11
3) Interpretation
a: Fleiss’ $\kappa$
To see the consistency of evaluation between the raters, we have computed the inter-annotator agreement (Fleiss’
We have also calculated the
We can also notice that the agreement is much better for the part of summaries marked as best. This might be caused by a higher number of Special cases and undersampled error categories (see Best vs. Worst sets on the facing page).
b: Interaction of Mapping and Meaning
Next, we inspected the relationships between the categories falling under Mapping and Meaning (Figure 4). In the Sankey diagrams, we see strong connections between Fabrication
Sankey diagram showing the interaction between Mapping and Meaning error dimension.
c: Abstractiveness Vs. Error
Since with increasing abstractiveness, the amount of rewriting also grows (and it tends to be erroneous), it could be expected that these two variables would correlate. As suggested in [39], we have computed the abstractiveness as ROUGERAW-L F1 score – the longer common sequence we have, the lower abstractiveness it implies. We can see the trend in Figure 5. Even though the data is biased towards higher abstractiveness, we see a significant predominance of errors on the right side of the plot, while sentences with abstractiveness close to zero are likely to be correct.
Distribution of error types according to the abstractiveness of the generated text.
d: Best vs. Worst sets
Further, we provide a comparison of the errors within the best and worst subgroups (Figure 6). In the best row, we see a high proportion of correct sentences, accompanied by a pair of small peaks in Fabrication and Contradiction, which seem relevant according to the results from the Sankey diagram. In the “worst” line, the percentage of correct sentences is understandably lower. However, the relationship of Fabrication
e: Error ratios
Finally, we show the ratios of erroneous sentences within different subgroups of the data (see Table 8). The numbers are not directly comparable to any published results because of the unique nature of the subgroups. However, they conclusively confirm the difficulty of correct abstractive summary generation and the insufficiency of the ROUGERAW metric as an evaluation tool.
Discussion
As can be seen from the manual error analysis results, the biggest problem with CzeGPT-2 is adding information to the abstract that is not in the input text. This behavior consecutively creates situations where the abstract claims something that we do not find in the original article or even something that contradicts the article’s meaning. Unfortunately, this weakness seems to be paradoxically caused by one of the main advantages of the model – the ability to draw knowledge from the pre-training phase. The model is often easily carried away by the information it has encountered in the past and deviates from the sense of the original article. In our view, this issue would not necessarily be solved by a larger model as the “hallucination issue” is present in all current large generative models [42], but rather by an adjusted task-oriented architecture, e.g., a specific focus attention mechanism or a dual attention in encoder-decoder [43].
It also shows that the indisputable advantage of the neural abstractive method is that the model itself decides when to use the extractive and the abstractive approach. The error analysis shows that with increasing extractiveness, the sentences are generally less error-prone, but in some cases, the extractive technique can be too weak. Then, the model needs the ability to improvise. Such a combination is definitely the right step. However, the model must learn when to use which approach.
A. Dataset Notes
During the manual evaluation, we found out that error analysis is not only helpful in evaluating the summarizer itself. It also pointed out some shortcomings in the dataset, which can affect the results and possibly favor some methods. We often encountered articles that, at the beginning of the section marked as text (body of the article), contained a copy of its abstract. This makes it an easy job for summarizers that incorporate particular heuristics into their pipeline. An extreme example is the First [8] summarizer that only takes the first three sentences of the input text as an abstract.
The dataset also contains structured descriptions of movies or games, where the neural models learn how to answer correctly, but this does not develop a general ability to summarize text.
B. Best/Worst Selection
We would also like to mention the ambiguity of our decision to include the best and worst summaries in the manual evaluation process instead of a random selection suggested by the Lux et al. [39] methodology.
The advantage of our approach is that the choice of the candidate is an exact operation – the set of articles is uniquely determined (if there are several articles with the same ROUGE value, the order depends on the type of sorting algorithm and the dataset order, which is fixed). In contrast, with the original methodology, since the selected set is relatively small, it can easily happen that we do not choose sufficiently representative examples, and the result will be skewed. With different seeds, we can get very different outcomes.
On the other hand, the selection of such two extreme groups may not properly reflect the standard abilities and behavior of the summarizer. At the same time, it is tempting to choose certain types of candidates that, for example, are abundant in the training set. Such types can be the movie or game descriptions mentioned above.
Conclusion
In this paper, we have presented the process of training CzeGPT-2, a new autoregressive model by which we expand the ranks of Czech pre-trained transformer models. The model can be broadly utilized and fine-tuned for various downstream tasks, which in some form involve text generation. Here, we have used it as a building block to create a new abstractive summarizer that is compared with the current leading models on the largest Czech summarization dataset.
In standard metrics, CzeGPT-2 surpasses most of the previously published methods besides the significantly larger pretrained mBART large model with state-of-the-art results on both abstract and headline generation tasks. The CzeGPT-2 model is freely available in the standard Hugging Face website for model sharing.12 With more than 4,000 downloads, the model has already proved useful for various Czech language processing tasks.
Further, we have provided a detailed error analysis of CzeGPT-2 abstractive summarization results that bring us closer to revealing the mechanisms of error generation and their effects on the summary. Although such an analysis is time and human resources-intensive, we want to appeal for tuning a suitable methodology, making it possible to compare abstract summarizers more accurately in the future. The main reason is that the current ROUGERAW metric is not strong enough to reasonably determine the appropriateness or to reveal factual errors in the summaries.
Even though we reached nearly state-of-the-art results, the CzeGPT-2 summarizer can still be improved. Possible further paths can lie in enlarging the model, using task-specific architectures, or augmenting, cleaning, and expanding the dataset. The experience gained on the project is a strong impulse for our future work on this task.