Template-Based Headline Generator for Multiple Documents

In this paper, we develop a neural multi-document summarization model, named MuD2H (refers to Multi-Document to Headline) to generate an attractive and customized headline from a set of product descriptions. To the best of our knowledge, no one has used a technique for multi-document summarization to generate headlines in the past. Therefore, multi-document headline generation can be considered new problem setting. Our model implements a two-stage architecture, including an extractive stage and an abstractive stage. The extractive stage is a graph-based model that identified salient sentences, whereas the abstractive stage uses existing summaries as soft templates to guild the seq2seq model. A series of experiments are conducted by using KKday dataset. Experimental results show that the proposed method outperforms the others in terms of quantitative and qualitative aspects.


I. INTRODUCTION
I N the era of information explosion, people eager to find a way to acquire knowledge efficiently. To quick capture the ideas behind articles, people are likely to go through headlines first and then decide if an article is worthy to read. In this paper, we aims at generating headlines for multiple documents. Generating headlines for texts can be considered as a subproblem of summarization [1] [2]: The given sentence have to be representative and attractive. TemPEST [3] is a model design for generating personalized headlines which try to catch electronic direct mail receiver's attention. TemPEST is a soft template-based seq2seq model [4], including three stages: Retrieve, Rerank and Rewrite. The model builds an Information Retrieval (IR) system for index and search, reranks the search results then selects a suitable template. Beside summarize and abbreviate input document, a title generating model need to generate a suitable output. However, this work is design for single document. When we want to generate a representing sentence for a set of document, current model leads to a failure.
To solve the problem, we proposed a model "from Multi-Document to Headline", which generates a personalized headline for a set of input documents. Our model involves two stages, handling multi-document summarization and generating headline for multi-document. Instead of input user names and destinations in hard-template, our model is able to generate a real customized headline close to the user's preference. Different from TemPEST [3], our Rerank adds user click history to help find the style of template with the user's preference. Since we start directly by selecting the user's favorite template to avoid the problem of sparse input of the encoder, our Rewrite uses a single selective encoder [5].
The proposed model is evaluated on a new dataset from KKday, an e-commerce platform of tourism products. The dataset we used includes product descriptions, product introductions and blog articles. The product introduction data introduces the highlight of products. The product description data detailed introduce the usage and notice for products. Difference between introduction and description is shown in Table 1. Blog articles introduce an attraction which include multiple products related to the attraction. Hence, blog articles and their headlines are the baseline for comparing with the headlines we generated.
Our contributions are summarized as follows: • We propose a model MuD2H to generate a headline for a set of documents. This is the first work to use graph neural network to learn representative embed-Introduction Book with KKday in advance and gain access to the American Museum of Natural History. Admire a world-class collection of around 36 million specimens and cultural artifacts.

Description
• Highlights *** Avoid crowds and long lines by booking your tickets to American Museum of Natural History in advance *** Gain entry to both permanent and temporary exhibitions in the American Museum of Natural History *** Explore the American Museum of Natural History, one of the world's best scientific, educational and cultural institutions. • What You Can Expect ***Beat the lines and crowds by booking your tickets to the American Museum of Natural History with KKday in advance.
Step into the Hintze Hall and be greeted by a colossal blue whale skeleton that hangs suspended from the soaring ceilings of the rotunda. From there, explore its exhibition halls and take your time admiring the star specimens and highlights of the museum, not only outperform other baselines in terms of Rouge scores but also generate a user preferable headline by human evaluation. The rest of the paper is organized as follows. Section II mentions the past relevant research works, including extractive and abstractive summarization methods, multi-document summarization and two-stage architecture. Section III introduces the proposed model in detail. In section IV, we show the experiment results including baselines, evaluation and case study of our generated headlines. Finally, section V summarizes the conclusion.

II. RELATED WORK
In this section, we discuss three related topics: extractive summarization, abstractive summarization, and hybrid summarization.

A. EXTRACTIVE SUMMARIZATION
The Early studies of extract-based summarization were inspired by Pagerank [6]. For example, TextRank [7] and LexRank [8], they computed the salience with a similarity graph of sentences. Li et al. [9] applied the support vector regression model (SVR) for feature selection and weighting to conquer the semantic repeat problem. GreedyKL [10] used Kullback-Lieber (KL) divergence as the criterion for selecting a summary for a given text. New trends of extractive summarizations have been learningbased. G-Flow [11] works on sentence selection and ordering separately. To evaluate the coherence, G-Flow estimates its quality by using an approximate discourse graph (ADG), based on the hand-crafted feature. R2N2 [12] developed a ranking framework for redundant sentences. It transfers input sentences into a binary tree, then processes the binary tree by RNN recursively at each node. Yasunage et al. 1 Our dataset and code are available on: https://github.com/klks0304/mud2h [13] applied a graph convolution network [14] onto their proposed personalized discourse graph, and used GRU to calculate sentence embeddings. The model generates cluster embeddings to fully aggregate features between sentences with a document-level GRU. HSG [15] is a model which constructs a heterogeneous graph containing semantic nodes of different granularity levels apart from sentences. Additionally, HSG can flexibly extend from a single document setting to a multi-document setting. SgSum [16] is a graph based method which works on multi-document summarization by extraction based. SgSum consider the sentences as nodes, then generate a sentence relation graph for input documents. The model outputs a subgraph, then considered it as the result of summarization. ThresSum [17] is a recent publish paper applying powerful encoder to emphasize each sentence. The model utilize supervised variables to select sentences as close to original article meaning as possible, without setting limit to select k sentences.

B. ABSTRACTIVE SUMMARIZATION
General abstractive summarization approaches have recently shown promising results with sequence-to-sequence neural network architectures [18], which encode documents and then decode the learned representations into an abstractive summary. Rush et al. [19] first applied an attention-based sequence-to-sequence model for abstractive summarization. Nallapati et al. [20] futher changed the sequence-to-sequence model to a fully RNN-based model and achieved outstanding performance. The use of the RNN-based encoder-decoder structure has been used from time to time until now. For example, DRGD [21] uses a recurrent generative decoder to learn latent information of the text. On the other hand, Cao et al. [4] considered that seq2seq models tend to copy source words in order, so they proposed a soft template-based summarization Re 3 Sum. In traditional template-based approaches [1] [22], a template using the manually defined rules is an incomplete sentence which can be filled with the keywords. Because templates are manually defined, it is very time-consuming and also requires a great deal of manual effort. Re 3 Sum [4] proposed a novel soft template-based architecture, which uses existing summaries as templates to guide the seq2seq model. BiSET [23], the state-of-the-art template-based abstractive summarization method, follows the previous architecture. To improve expression for output, BiSET uses a bidirectional selective layer with two gates to select key information. TemPEST [3] proposes a personalized subject generation model, which adds a user-aware sequence encoder to generate user-specific article representation, and assists machine generating user-specific subjects. Most of the abstraction-based summary method are seq2seq model, therefore a toolkit NATS [24] collect these methods and conduct on CNN/Daily Mail dataset. Usually, abstraction-based methods are only suitable for single document summarization. Recent publish method BASS [25] applies semantic graph to connect words in input documents. The method BASS connect inputs can connect words be-tween different documents, therefore it also works on multidocument summarization. BASS [25] successfully minimize the gap between the multi-document summarization problem and abstract summarization model.

C. HYBRID SUMMARIZATION
Liu et al. [26] proposed a two-stage model T-DMCA for multi-document summarization, concatenating extractive and abstractive summarization methods. This work is not famous by its model, but proposed a well-known dataset, WikiSum, which is applied by following summarization works. T-DMCA has its best result when applying term frequencyinverse document frequency (tf-idf) to rank its sentences in the extractive stage, then applies a transformer decoder with memory-compressed attention in the abstractive stage. Hier-Summ [27], also a two-stage model, adopts logistic regression to help ranking paragraphs, then applies a global transformer [28] layer to exchange information across paragraphs, and outputs an abstractive summary. ESCA [29] applies a matrix layer after sentence encoder. The matrix layer efficiently controls the outcome of extractor. Since the extracting summarization gives a more human-writting sentence, adjusting the outcome then combines it with the abstractor gives an more high quality summary. TG-MultiSum [30] extract the topic of each document and construct a heterogeneous graph representing each document, then learn for a summary. CABSD [31] works similarly, they extract sentence from the learned subtopic, then generate an abstract summary. The most reason works such as ESCA, TG-MultiSum and CABSD are in two stages. They first extract from input then abstract an output summary, which is the trend of two stages multi-document summarization.

III. PROPOSED MODEL
Our proposed model is designed in two stages. Before generating a representative headline for input documents, we extract sentences to generate a overall meaning for the documents. In general, our first stage is an extractor and the second stage is an abstractor. Figure 1 shows the structure of the proposed model.

A. THE EXTRACTOR
In the extractor, given a collection of documents D, our goal is to extract some salient sentences from these documents. Let D denote a set of documents as Traditional approaches for extraction-based summarization rely on human-crafted features. To adjust this problem, we proposed a data-driven approach, adopting a graph-based learning approach model. We build a sentence relation graph to capture the relation among sentences, each sentence is fed into a recurrent neural network to generate sentence embedding. The next step is to apply the Graph Convolutional Network [14] on the sentence relation graph and sentence VOLUME 4, 2016 embedding as an input node feature. Applying sentence relation graph and sentence embedding on Graph Convolutional Network can produce a high-level hidden feature for each sentence. After that, we use a linear layer to estimate a salience score for each sentence. Giving salience score for each sentence helps model extract suitable sentence from documents. Finally, instead of selecting top salience score sentences, we use a greedy method to select the salience sentences to represent input set of documents.

1) Sentence Relation Graph
In the sentence relation graph, each vertex represents a sentence s i,j , which means the j'th sentence of document d i . The weight of the undirected edge between s i,j and s i ′ ,j ′ indicates their degree of similarity. We use cosine similarity between each sentence pair (s i,j , s i ′ ,j ′ ) and construct a complete graph. However, the model is not able to work significantly if we input this semantic sentence relation complete graph directly, because there is too many redundant information in a complete graph. To emphasize sentences with higher similarity, we set a threshold t g and remove the edges that has weight under the threshold. The sentence relation graph is represented as an adjacency matrix A by graph convolutional network [14] of salience estimation. The algorithmic form of relation graph generating process is given in Algorithm 1.

2) Sentence Embedding
Given a collection of documents D, we encode all sentences which have appear in each document. For all words in sentence s i,j , we convert each word into a word embedding, then feed word embeddings in a sentence s i,j into the sentence encoder to generate s i,j 's sentence embedding s ′ i,j . The dimension of sentence embedding s ′ i,j is d s . We use a recurrent neural network (RNN) with Gate Recurrent Unit (GRU) as the sentence encoder, where the last hidden state is sentence embedding. All sentence embedding from the given collection of documents are concatenated as the following: M i represents the number of all sentences have appears in the document set D. The matrix X will be considered as the feature matrix to apply the graph convolutional network [14] using salience estimation.

3) Salience Estimation
A Graph Convolutional Network is a multi-layer neural network which operates directly on a graph and induces embedding vectors of nodes based on properties of their neighborhoods. Layer-wise linear formulation allows the model to capture higher level hidden feature in sentences. We use adjacency matrix A to formulate sentence graph, and use X as its feature matrix representing in this step.
• A ∈ R M ×M , the adjacency matrix of the sentence relation graph, where M is the number of vertices. In particular, if the i th node is adjacency to the j th node, then a ij = 1. Otherwise, a ij = 0. • X ∈ R M ×ds , the input node feature matrix, where d s is the dimension of feature vectors. The output of this stage is a high-level hidden feature for each node, S ′′ ∈ R M ×F , where F is the dimension of output vector embedding. In order to include the nodes' own features in the aggregate, we add self-loops to the adjacency matrix A such thatÃ = A + I M , where I M is the identity matrix. Our propagation rule follows: is the normalized symmetric adjacency matrix. D is the degree matrix where its ith diagonal elements is sum of elements in ith row ofÃ. W l is an inputto-hidden weight matrix to learn in the ith layer, and b i is the bias vector. We use Exponential Linear Unit (ELU) [32] instead of Reflect Linear Unit (ReLU) [33] as the activation function, because Exponential Linear Unit tends to converge cost to zero faster and deal with the vanishing gradient problem better. Subsequently, we use a linear layer to project the high-level hidden feature of each sentence to the salience score. Additionally, we normalize the salience score via softmax: Note that s i,j is the jth sentence in ith document, and s

4) Training
Previous works [34] [13] use cross-entropy loss for training. When we trained our model with cross-entropy, the loss tends to output scores close to 0 or 1 which may cause an obstacle for ranking. To overcome this problem, we trained the model with contrastive loss. Since we select sentences by salience score, sentence selection problem can be considered as a ranking problem. The relative ranking between sentences is considered more important than absolute scores, and constrastive loss gives the relative ranking score. Thus, refering to contrastive loss [35], we define the ranking loss as: where y is a label, which represents whether the rankings of the two sampled sentences are similar (y = 0) or far (y = 1) comparing with σ = V ar(R(S)). µ is a setting margin. D represents the distance between two given sentence, we use Euclidean distance here. More precisely, and R(s) = sof tmax(r(s)), where r(s) is the ROUGE-1 recall score of sentence s by measuring with the ground-truth.
The objective function represents that if two data points are considered similar (y = 0), we minimize the distance between them. Far pairs contribute to the loss function only if their distance is within a specified margin. When the distance between two data points is considered far (y = 1) and their distance is less than the margin, we replace their distance as the margin, to let the loss function give a penalty.

5) Sentence Selection
After sorting the sentences in descending order according to the predicted scores of our model, we start to choose sentences. Rather than intuitively selecting the Top-k sentences, we apply a greedy strategy to select sentences. Greedy strategy is able to select diversity sentences instead of repeated meaning sentences [36] [12]. Every time we select one sentence from the top of the list, we check whether it is nonredundant with the existing sentences. To determine whether the sentence is redundant, we use tf-idf cosine similarity.
For an input sentence s and selected sentence set C, if cosine similarity between s and all sentences in C is small, and the sentences already selected is above a threshold t s , the sentence is considered redundant. If not, we select the sentence. We repeat this step until the expected number of sentences n is reached. redThe algorithmic form us shown in Algorithm 2.

B. THE ABSTRACTOR
In the abstractor, our goal is to generate a headline, which needs to be personalized, attractive, faithful and within the length constraint. Therefore, we referred to previous template-based summarization [4] [23] frameworks in the abstractor. The input of the abstractor is a collection of VOLUME 4, 2016 sentences produced by the extractor. These sentences were concatenated, hence it can be considered as an article. Each article A r consists of n words {x a i | i ∈ [1, n]}. Let T denote a set of templates in the training corpus as T = {t i | i ∈ [1, p]}, where p is the number of all template candidates in our dataset.
For the given article, we use the Information Retrieval (IR) platform to find out some soft template candidates from T , and then further choose the best template T ′ = (x t 1 , x t 2 , . . . x t n ) by Rerank or user click history. Subsequently, we extend a seq2seq model to generate a headline by learning important information from A r and T ′ .

1) Retrieve and Rerank
The goal of Retrieve and Rerank is choosing the best template t for A r . Retrieve aims to return some template candidates from the training corpus. We assume that similar sentences hold similar summary patterns. Therefore, given an article, we find its analogy in the training corpus and pick their headlines as the template candidates. Given A r , we use the widely-used IR system Pylucene 2 to retrieve a set of similar articles, and their headline will be treated as the template candidates. For each A r , we choose the top 30 searching results as template candidates.
The Retrieve process is only based on word matching or text similarity, but does not measure their deep semantic relationship. Therefore, we use Doc2Vec [37] embedding to compute cosine similarity to identify the best template in the template candidates. Additionally, in our work, we expect our generating headline to be personalized, so we add the user click history to help us choose a template. We record the title of the product that the user has clicked as user click history. As a result, in the Rerank process, we join the user click history to compute cosine similarity with template candidates to select our desired template t for A r . Our implementation model in the Rewrite step is inspired by BiSET [23] and selective mechanism [5]. Before the Rewrite step, remind that we have a source article A r and its suitable template T ′ learned from Retrieve and Rerank. We use a twolayer Bidirectional Long Short-Term Memory (BiLSTM) as the encoder layer to encode the article and the template into hidden states h a i and h t j respectively. The role of Rewrite is to select important information. As shown in Figure 2, there are two selective gates: the Template-to-Article (T2A) gate and the Article-to-Template (A2T) gate. The T2A gate can apply the template to filter the article representation. We concatenate the last forward hidden state − → h t n and backward hidden state ← − h t 1 as the template representation h t . For each time step i, it takes h t and h a i as inputs to output a template gate vector g i to select from h a i : where σ denotes the sigmoid activation function, and ⊙ is element-wise multiplication. After the T2A gate, we obtain a sequence of vectors (h a ′ 1 , h a ′ 2 , · · · , h a ′ n ). The goal of the A2T gate is to control the proportion of h a ′ in final article representation. We assume that the source documents are credible, therefore implies current stage article A r is credible, and learn a confidence degree d to decide the proportion of h a ′ i : h a is generated in the same way as h t : concatenating the forward hidden state − → h a n ′ and backward hidden state ← − h a 1 . The final article representation is computed by the weighted sum of h a ′ i and h a i : The above finishes the encoding part of the input article, it selects important information then gives a vector representation. In the decoder part, we stacked two layers of an Recurrent Neural Network with a Long Short-Term Memory unit, and use an attention mechanism [38] to generate the headline. At each time step t, LSTM reads the previous word embedding w t−1 and hidden state h c t−1 generated in the previous step, and then outputs a new hidden state for the current step: where the initial hidden state of the LSTM is the original article representation h a . The context vector c t for current time step t is computed through the concatenate attention mechanism [38], which uses h c t and z a to get importance scores. The importance scores are then normalized to get the current context vector by weighted sum: Subsequently, we use a concatenation layer to combine the hidden state h c t and context vector c t into a new readout hidden state h o t : In the final stage, h o t is fed into a softmax layer to output the target word distribution for predicting the next word w t over existing words w 1 , w 2 , . . . , w t−1 :

3) Training
The objective function includes two parts. To learn the generation of headlines, we minimize the negative log-likelihood between the generated headline w and the human-written headline w * : To learn the style of the template, we minimize the negative log-likelihood between generated headline w and the template w t : In other words, adjusting L h optimize the capture information from input documents. If L h is small, then it is close to the original meaning of the input sets. On the other hand, adjusting L t optimize the personalized style of the headline. When L t is small, it is closer to user's favor template, therefore outputs a personalzed headline. The final objective function combines the above two:

IV. EXPERIMENT
The goal of this work is to generate a suitable headline for input set of documents. More specifically, we convert our problem into the following questions: • How can the sentence relation graph be constructed in order to achieve the model's best performance? • Can our extractor outperform other extraction-based summarization models? • Is using the complete two-stage architecture model better than just using the abstractor?

A. DATASETS
We used a real-world dataset provided by the traveling ecommerce platform, KKday 3 . KKday provides over 30,000 3 https://www.kkday.com/zh-tw products from over 90 countries, including local tours, activities, and tickets. We trained the extractor using product introductions, descriptions and titles. The KKday blog dataset that mentions that different products provide materials for research on MuD2H applications in multiple documents. On average, each blog mentioned eight products. The headlines generated by MuD2H were compared with the original headline of the blog article. In conclusion, 80% of the dataset was assigned for training, and 20% for validation and testing. Figure 3 describes the relationship between product introductions and blog information. Dataset and implement detail are provide in the supplementary material 4 . Overview of our dataset is describe in

B. IMPLEMENTATION DETAILS
To set the edge weight of our sentence relation graph, we set t g = 0.1 according to the following experiment. Each document is tokenized by Chinese Knowledge and Information Processing (CKIP). Word2Vec [39] and Doc2Vec [37] embedding is implemented with gensim and pretrained on the latest Chinese Wikipedia dataset. The output dimension of sentence embeddings is the same as word embedding, i.e. 250. For the graph convolutional network, we set the embedding size of the first convolution layer as 400 and the embedding size of the second convolution layer as 128. The batch size we use is 16. The objective function is optimized using Adam [40] stochastic gradient descent with a learning rate of 0.0075 and early stopping with a window size of 10. We apply dropout with probability 0.2 before the linear layer. The threshold t s in sentence selection is 0.8 (tuned on validation set). For the abstractor, we construct our architecture referring to BiSET [23], which is extended from the popular seq2seq framework OpenNMT [41]. The size of word embeddings and LSTM hidden state are set to 500. Additionally, the objective function is optimized using Adam optimizer with a learning rate of 0.001. For all baseline models, we use default parameter settings in their original paper or implementation.

C. EVALUATION METRICS
To analyze the influence of the different methods in the sentence relation graph, we use Normalized Discounted Cumulative Gain (NDCG) [42] for evaluation. NDCG is a ranking evaluation metric. We view our problem as ranking problem in training the extractor, so we use NDCG for performance comparison.
For the summarization task, we adopt Rouge [43] score for automatic evaluation. Rouge-1 and Rouge-2 are the rate of the length of the largest common sub-sequence, and Rouge-L can find out the longest common sub-sequences of words between the original summary and the predicted summary. Additionally, we use Word2Vec [39] cosine similarity to measure the average similarity between the output and each document because we expect that our output can express the meaning of each document.

D. SENTENCE RELATION GRAPH COMPARISON
Different methods converting relations between sentences into numeric result will influence our sentence relation graph. We try different ways including two embedding methods (LexRank and TextRank). Convert the value of t g from 0 to 0.2 to observe the impact. The considered methods include: 1) Cosine: Calculate Word2Vec cosine similarity between each sentence pair. 2) TextRank [7]: A weighted graph is created where nodes are sentences and edges defined by similarity measures based on word overlap. Then we use an algorithm similar to PageRank [6] to calculate the importance of the sentence and the precise edge weight. The transition matrix that describes the Markov chain used in PageRank is extracted. 3) LexRank [8]: A widely used multi-document extractive summarizer based on the concept of eigenvector centrality in a graph of sentences is used to set up the edge weights. We build a graph with sentences as nodes and edges weighted by tf-idf cosine similarity, then run a PageRank-like algorithm. 4) tf-idf: Consider a sentence as query and all the sentences in multi-document as the document. The weight corresponds to the cosine similarity between each query pair. Table 3 is the experiment result. We choose the best method and parameters of the experimental results for the rest of MuD2H model (our model). The result shows that using cosine similarity to build the sentence relation graph is significantly better than other methods on NDCG evaluation. The possible reason is that cosine similarity relies on the semantics of the sentences rather than its words matching.

E. QUALITATIVE RESULTS
First, we compared our extractor model with some extraction-based summarization.  of the ROUGE recall scores. Random represents randomly choosing k sentences in our sentence selection set, and Topk takes the top similar sentence by cosine similarity. Compared with traditional methods, for example TextRank [7] and Continuous LexRank (Cont. LexRank) [8], our model performed better in the Rouge score. The state-of-the-art graph-based approach SemSentSum [34] is a fully datadriven model that uses cross-entropy as the objective function. As expected, it outperformed other traditional baselines in Rouge-2 and Rouge-L, but our model still performed better. This is because sentence ranking starts to become unstable in the deeper layer because SemSentSum applies cross-entropy as their objective function, loss tends to fade and our contrastive loss function plays a role. Maximal Margin Relevence (MMR) [44] is a well-known greedy algorithm for multi-document [45], and improvements of MMR have been proposed. For comparison, we use state-of-the-art phrase embedding-based MMR [46] as a baseline. It focuses on producing a non-redundant summary, so its output has relatively high word diversity. The Rough-1 score of Top-k is higher than Rouge-2 and Rouge-L scores. It can be seen that these scores of Top-k are close to those of the proposed method, which means Top-k can also include sentences with close meaning. However, Table 6 presents a case study to demonstrate the limitation of Top-k. In brief, Top-k selects sentences with repeated meaning. As shown in Table 6, the first and second sentence selected by Top-k are about the Universal Express Pass. On the contrary, the proposed method could select sentences by taking diversity and relevancy into account. This result shows the advantage of the proposed method. However, it is challenging to determine whether finding diverse sentences is the key because more off-topic sentences could be found. In terms of results, graph-based methods including our model and SemSentSum are better than MMR in ROUGE recall score.
In the multi-document summarization task, an important goal is that the generated results need to express the focus of each document. This problem is at the semantic level, so we adopt Word2Vec similarity. We measure the average similarity and standard deviation between our outputs and each input document. The average similarity should be as large as possible, but the standard deviation should be as small as possible. Table 5 shows the cosine similarity for different models. Our model has the highest average cosine similarity. Since our input set of documents have clear relation, for example, products mentioned in the same blog must from same city, therefore a success multi-document model should at least catch the city characteristic. If the model catches the common characteristic, it is easy to get high score in our experiment results. In other words, it is difficult to get low scores for the models we select.
In order to prove that our two-stage model is useful, we separately use the output of the different extractors and the result of directly concatenating the documents as the input of the model. Table 7 shows the result of the experiment. We use Rouge F1 scores between our generated headline and human written headline, and average Word2Vec cosine similarity between our generated headline and every document. The performance of our model is better when we use the extractor in the first stage. We consider that this is due to the fact that the extractor has the focus of capturing cross-documents. Furthermore, our complete model beats all the baseline models. It shows the best result on real dataset application.

1) Human Evaluation
In addition to the automatic evaluation, we also access model performance by human evaluation in a real case. We conducted a user survey with 31 users, including computer science graduate students and web users. Each sample includes a set of product introductions and headlines generated by different methods. We ask the users to rank each headline on a scale of 1 to 4. The result in Table 8 shows that the most attractive headline is human-written, and the second place is generated by our model. In our statistics, 65% of people consider that human-written headlines are the best, and 50% of people consider that the headlines generated by our model are second only to the human-written. However, our model is the best over these auto-generated headlines.   2) Case Study Table 9 and Table 10 shows a case study of the customized headline generation task. Given multi-product introductions as Table 9, our proposed model can generate different style headline according to different template as Table 10. Users may favor in different template, therefore attract by different headline. We design the user-specific headlines according to the click history of other products.

V. CONCLUSION
In this study, we propose a two-stage model MuD2H that generates a summary and headline for multiple documents.
To the best of our knowledge, this is the first model to generate headlines for multiple documents. To evaluate the proposed model MuD2H, we collect a new dataset from an e-commerce site of tourism products, which contained product descriptions, product introductions, blog articles, and user browsing records. The first stage of our research involved graph-based extractive summarization. We applied a graph convolutional network to learn the sentence features for salience estimation. Our cross-calculation ensures that the output summary covers the meanings of the input document set, rather than repeating words or sentences. The second stage is template-based abstractive summarization. We learn users' text preferences from their browsing history and then apply their favorite headline type as a soft template to guide the seq2seq model. MuD2H outperforms the existing summarization models and meets the company's requirement of generating personalized headlines for different users. In addition, we present human evaluations and case studies to illustrate our results.

ACKNOWLEDGMENT
Thanks to KKday's data group for helping and providing dataset. Thanks to Yu-Chien Tang's help in experiments.