ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112-times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8-states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs.


INTRODUCTION
H IGH-PERFORMANCE COMPUTING (HPC) has recently been advancing hand-in-hand with Deep Learning (DL) to achieve new scientific breakthroughs in both fields. More powerful supercomputers [1], [2] and advanced libraries [3], [4], [5], [6], [7] enable the training of ever more complex models on bigger data sets using advanced processing units such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) at increasing speeds and efficiency. HPC hardware is advancing both through infrastructure of supercomputers, such as Fugaku [8], Summit [1] or the SuperMUC-NG [9], and through its components, such as TPU pods [2], specifically designed to ease large scale neural network training for users. Concurrent software improvements in form of more efficient libraries such as Horovod [6] allow executing general purpose code on large distributed clusters with minor code changes.
Through contextualized Language Models (LMs) [10], [11], Natural Language Processing (NLP) has been benefiting more from advances in HPC than other fields. In particular Transformers [12] have reached state-of-the-art performance in several tasks including translation, summarization and question answering [13], [14]. LMs are trained on unlabelled data; this independence of expensive validated data opened vast sets of raw big data allowing to up-scale LMs in NLP by orders of magnitude. The self-supervised training exclusively relies upon the sequential order of the input. Two approaches make use of this information, namely autoregressive (predict next token in a sequence, given all previous tokens) and auto-encoding (reconstruction of corrupted input) training. Once trained, LMs can extract features, referred to as embeddings, to use as input in subsequently trained supervised models (transfer-learning). This two-step training outsources the computationally expensive LM pretraining to the HPC infrastructure while the computationally simple inference can be done on commodity hardware.
Protein research provides an excellent use-case for transfer-learning as large amounts of exponentially growing but unlabelled data contrast much more limited sets with experimental annotations. One example for this is the "sequence-structure" gap [15], i.e. the gap between the number of proteins for which one-dimensional (1D) sequences are known and the orders of magnitude smaller subset of proteins for which their three-dimensional (3D) structures are known. Knowing these structures is crucial for understanding their function. Such understanding is needed, e.g.
to possibly disrupt the binding of the spiky S1 protein of the SARS-Cov-2 virus that by binding to the human receptor ACE2 caused the COVID-19 pandemic. The sequencestructure and sequence-function gaps, or more generally the sequence-annotation gaps keep growing exponentially. Closing those gaps through prediction methods based on artificial intelligence (AI) is one of the crucial challenges for computational biology and bioinformatics.
Recently, the leap of NLP through advanced LMs have successfully been generalized toward understanding the language of life through advanced LMs trained on proteins [16], [17], [18], [19], [20], [21], [22], [23], [24]. The main concept behind these approaches is to interpret protein sequences as sentences and their constituent -amino acidsas single words. Protein sequences are constrained to adopt particular 3D shapes (referred to as protein 3D structure) optimized for accomplishing particular functions. These constraints mirror the rules of grammar and meaning in natural language thereby allowing to map algorithms from NLP directly onto protein sequences. During training, the LM learns to extract those constraints from millions of examples and store the derived knowledge in its weights. While existing solutions in Protein Bioinformatics [25], [26], [27], [28], [29], [30] usually have to search for evolutionary related proteins in exponentially growing databases, LMs offer a potential alternative to this increasingly timeconsuming database search as they extract features directly from single protein sequences. On top, the performance of existing solutions deteriorates if not a sufficient number of related sequences can be found, e.g. the quality of predicted protein structures correlates strongly with the number of effective sequences found in today's databases [31]. Additionally, some proteins are intrinsically hard to align (e.g. intrinsically disordered proteins [32] or proteins which do not have any related sequences (dark proteome, [33]).
In this project (named by ProtTrans), we pursued two objectives. Firstly, we explored the limits of up-scaling language models trained on proteins as well as protein sequence databases used for training. Secondly, we compared the effects of auto-regressive and auto-encoding pre-training upon the success of the subsequent supervised training, and compared all LMs to existing state-of-the-art solutions using evolutionary information [34].

Data for Language Models (LMs)
In this work, we assessed the impact of database size on performance through two data sets: UniRef100 [35] (with 216M protein sequences) and BFD [36], [37] (with 2,122M sequences). The latter merged all protein sequences available in UniProt [38] and proteins translated from multiple metagenomic sequencing projects, making it the largest collection of protein sequences available at the time of writing. The original BFD set contained several copies of identical sequences; only one of those was kept, resulting in a subset with 2.1 billion (2.1B) protein sequences (with >393B amino acids requiring 527GB of disk space as text); we dubbed this set as BFD. This compared to UniRef100 with 216M proteins (80B amino acids, 150GB disk space; Fig. 1. Overall, BFD was about eight times larger than the largest data sets used previously [19]. Despite the 8-fold increase in data, the number of tokens increased only fivefold (Fig. 1b, because UniRef100 sequences were longer than those in BFD (1.6-fold). A similar trend held for disk storage (Fig. 1c. Translating LMs from NLP to proteins interprets amino acids as words. Thereby, protein databases contain several orders of magnitude more tokens than corpora used in NLP, e.g., Google's Billion Word data set [39] is one of the biggest for NLP with about 829 million tokens (words), i.e. about 500-times fewer than BFD with 393 billion tokens. Both UniRef100 and BFD were tokenized with a single space (indicating word-boundaries) between each token. Each protein sequence was stored on a separate line, with lines/proteins representing the equivalent of "sentences". Additionally, an empty line was inserted between each protein sequence in order to indicate the "end of a document" as some LMs such as Bert use consecutive sequences for an auxiliary task, i.e. next-sentence prediction, which was not used in this work. As a minor filtering step, all non-generic or unresolved amino acids (B, O, U, Z) were mapped to 'unknown' (X). After this pre-processing, Uniref100 required 150GB GB of storage, BFD 734 GB. For training ProtTXL, the data was transformed to pytorch tensors on the fly. For Prot-Bert and ProtAlbert, the data had to be pre-processed and stored as tensorflow records, raising the storage to 2.3TB and 22TB for UniRef100 and BFD, respectively. Given tensorflow records with terabytes, data sets had to be chunked into 6000 files for thousands of parallel workers. We also compared the amino acid frequencies between databases as shown in Fig. 1d in order to detect potential biases.

Data for supervised training
The information learnt by the LMs was condensed in form of embeddings which were compared quantitatively through their value for subsequent 2nd-step supervised training. Toward this end we used previously published data sets for ease of comparison to state-of-the-art methods based on evolutionary information and to methods extracting features through pre-trained LMs.
Per-residue prediction: When predicting properties on the level of single residues, the data set published alongside NetSurfP-2.0 [25] was used for 3-and 8-state secondary structure prediction. The NetSurfP-2.0 dataset was created through PISCES [40] selecting highest resolution protein structures (resolution <=2.5A) from the PDB [41]. The set was redundancy-reduced such that no pair of proteins had >25% pairwise sequence identity (PIDE), leaving 10791 proteins to train. About 500 proteins were randomly removed from this set and used as validation set to determine hyperparameters such as early stopping. The final performance was evaluated on three different data sets, each with <25% PIDE to the training set: CB513 (513 proteins; [42]), TS115 (115 proteins; [43]) and CASP12 (21 proteins; [44]).
Per-protein prediction: For the prediction of features of entire proteins, the DeepLoc [26] data set was used to classify proteins into membrane-bound and water-soluble and for classifying proteins into ten classes of subcellular localization (also referred to as cellular compartments). This DeepLoc data set was created by pulling all proteins with experimentally annotated localization from UniProt (release: 2016_04). Proteins in this set were redundancy reduced at a level of PIDE<30% and split into 6621 proteins for training and 1841 for testing.

Data: unsupervised embeddings
The embeddings extracted by the LMs were also evaluated visually by projecting the high-dimensional representations down to two dimensions using t-SNE [45]. A non-redundant (PIDE<40%) version of the SCOPe database [46] (release 2.07 with 14323 proteins) served as one way to interpret the t-SNE plots. For a subset of those proteins, we used experimentally annotated EC (Enzyme Commission [47]) numbers for functional classifications. Taxonomic identifiers from UniProt mapped proteins into one of the three major domains of life (archaea, bacteria, or eukarya) or to viruses (removing all proteins with missing classifications). The number of iterations for the t-SNE projections was set to 3000 and the perplexity to 30 for all plots with the exception of the amino acid plot for which we used a perplexity of 5. In this work, four LMs which achieved significant improvements in NLP (Bert [48], Albert [49], Transformer-XL [50] and XLNet [13]) were trained on protein sequences. Bert was the first bidirectional model in NLP which tried to reconstruct corrupted tokens, and is considered the de-facto standard for transfer learning in NLP. Albert reduced Bert's complexity by hard parameter sharing between its attention layers which allows to increase the number of attention heads (64 chosen here). Transformer-XL was chosen because it overcomes the problem of having a maximum sequence length, which was inherent to all previous Transformer based models (including Bert and Albert). With the average length of an English sentence around 15-30 words [51], an upper sentence length limit is no problem for sentencelevel NLP tasks but many proteins are more than 10-times longer resulting in an average length of about 350 residues (residues is the term used to describe amino acids joined in a protein sequence, i.e. the sentence length measured in number of words). For example, around 20% of the sequences in UniRef100 (216M sequences) are longer than 510. Transformer-XL still cuts sequences into fragments but allows for flow of information between fragments for longer proteins by re-using hidden states of fragments which have already been processed. This memory is uni-directional as fragments are processed sequentially. XLNet uses the memory mechanism introduced by Transformer-XL to also allow for processing of sequences of arbitrary length. While the memory remains uni-directional for both, Transformer-XL and XLNet, only XLNet allows to gather bidirectional context within one memory fragment while Transformer-XL has only access to uni-directional context. All these models were trained on UniRef100 and Transformer-XL was additionally trained on BFD (Table  1 for model parameters). Largely, we used configurations successfully transferred from NLP to protein sequences [21], [24], [52], with the exception of the number of layers that was increased to optimize memory utilization. Bert, TransformerXL and XLNet were trained with a hidden layer size (dimensionality of the features which can be extracted) of 1024 while Albert was trained with a hidden layer size of 4096. Models which use positional encoding like Bert and Albert, can process only sequences shorter or equal to the length of the positional encoding which has to be set before training. Setting the length of the positional encoding to 40k allowed the models to process protein sequences up to a length of 40k. Albert, Bert and Transformer-XL were optimized using the Lamb optimizer [53] designed for large batch sizes, while XLNet was optimized using Adam. No auxiliary tasks like Bert's next-sentence prediction were used for any model described here.

ProtTXL:
The Transformer-XL versions trained here on protein sequences are referred to as to ProtTXL (only Prot-TXL when trained on UniRef100 and ProtTXL-BFD when trained on BFD). Both LMs were trained with the configuration shown in Table 1, sharing a dropout rate of 15%, a memory length of 512 tokens and using mixed precision. The number of layers, number of heads, batch size, learning rate, weight decay, training steps and warm-up steps were adjusted according to training set size as well as GPU utilization. We focused especially on the complex interplay between learning rate and the number of warm-up steps which was shown to be crucial to prevent deeper layers of creating instability during training [54] and speed-up model convergence [55]. Here, the number of warm-up steps was set to cover at least one epoch for each data set. We tested initial learning rates between 0.001 and 0.005 which were increased linearly at every training step over the warmup period. To avoid model divergence during training, the learning rate had to be (i) reduced along with the warm-  up steps (for BFD), or (ii) increased for both (for Uniref100). Even after increasing the warm-up steps to two epochs, the maximum learning rate remained at 0.0025 for both data sets. Beyond this point, the training diverged. Using weight decay to regularize the network increased the GPU memory usage as it required to compute the norm of all weight vectors on our models, thus reducing the batch size. ProtTXL-BFD was trained for 40k steps in total, with 13.6k warmup steps using a learning rate of 0.0005, while ProtTXL was trained for 31k steps with 5k warm-up steps using a learning rate of 0.002. The Lamb optimizer was able to handle the resulting batch sizes of 44k and 22k for ProtTXL-BFD and ProtTXL, respectively, without divergence.
ProtBert: Bert was trained using both UniRef100 and BFD-100 datasets, these two versions are referred as ProtBert and ProtBert-BFD, respectively. Both LMs were trained with the configuration shown in Table 1. Compared to the original Bert publication, the number of layers was increased in order to potentially reach better performance in supervised downstream tasks, while keeping inference time as well as GPU memory consumption at a reasonable level. Unlike Transformer-XL which was trained on Nvidia GPUs, mixedprecision was not used to train other models because those were trained on TPUs. Similar to the Bert version trained in the Lamb paper [53], ProtBert was first trained for 300k steps on sequences with a maximum length of 512 and then for another 100k steps on sequences with a length of a maximum length of 2k. While ProtBert-BFD was trained for 800k steps, then for another 200k steps for sequences with maximum length of 512 and 2k, respectively .This allows the model to first extract useful features from shorter sequences while using a bigger batch size, which makes training on longer sequences and thus overall training more efficient.

ProtAlbert:
We referred to Albert trained on UniRef100 as to ProtAlbert. We used the configuration from the official GitHub repository for Albert (version: xxlarge v2) with 12 attention layers. For Albert the number of layers is increased through the number of times that Albert stacks its single layer. Compared to the original publication, we were able to increase the global batch size from 4096 to 10752 despite using the same hardware. The reason for this counter-intuitive effect is the reduced vocabulary size in protein sequences because the entire diversity of the protein universe is mapped to 20 different amino acids, compared to tens of thousands of different words. As ProtAlbert was also trained on TPUs, no mixed-precision was used for training. Similar to ProtBert, ProtAlbert was first trained for 150k steps on sequences with a maximum length of 512 and then for another 150k steps on sequences with a maximum length of 2k.
ProtXLNet: XLNet was trained on UniRef100 (ProtXL-Net) using the original NLP configuration [13] (Table 1) except for the number of layers that was increased to 30 layers which reduced the global batch size to 1024. Due to the relatively small batch-size, we used the original optimizer: Adam with a learning rate of 0.00001. The model was trained through more steps, i.e. 20k warm-up and 847k steps to compensate for the smaller batch-size of this model.

Models stage 2: supervised models
The second-stage supervised models using the embeddings from the LMs as input were deliberately kept relatively minimal to focus the differential analysis on the power of the LM embeddings. All our experiments used the pretrained LMs as feature extractors without fine-tuning, i.e. without gradient back-propagating to the LMs. Thereby, we could proxy the information contained in the embeddings through the performance of the supervised tasks. The su-pervised models have been described before [17]. To briefly summarize: we applied tasks on two different levels, namely per-residue and per-protein predictions. For the per-residue prediction a simple two-layer convolutional neural network (CNN) was trained on the embeddings. The first layer of our CNN compressed the output of the language models down to 32 dimensions using a window size of 7 (1024 for ProtBert, ProtTXL and ProtXLNet, 4096 for ProtAlbert). The compressed representation was fed to two different CNNs each having again a window size of 7. One of these CNNs was trained on predicting secondary structure in 3-states, the other was trained on predicting 8-states. The network was trained on both outputs simultaneously by adding their losses (multi-task learning). For the per-protein prediction features were also extracted from the last layer of the LMs. However, for this task the representations were averaged (mean-pooled) over the length-dimension of the protein resulting in a fixed-size representation for all proteins. The resulting vector (1024-dimensional for ProtBert and Prot-TXL, 4096-dimensional for ProtAlbert) was used as an input to a single feed forward layer with 32 neurons which compressed information before making the final predictions for both per-protein tasks simultaneously (multi-task learning).

ORNL Summit & Rhea:
The Oak Ridge National Laboratory (ORNL) provides several clusters for researchers who need computational resources not provided by research facilities such as universities. Here, we used Summit and Rhea. Summit was used to train the deep learning models, while Rhea was used for the pre-processing of data sets including the distributed generation of tensorflow records.
Summit is the world's second fastest computer, consisting of approximately 4618 nodes. Each node has two IBM POWER9 processors and six NVIDIA Volta V100 with 16GB of memory each ( Figure 2 [1]). Every POWER9 processor is connected via dual NVLINK bricks, each capable of a 25GB/s transfer rate in both directions. A single node has 0.5 TB of DDR4 main memory and 1.6TB of non-volatile memory that can be used as a burst buffer. Summit is divided into racks with each rack having 18 nodes. In all of our experiments we reserved 936 nodes for training. As having nodes on the same rack decreases the communication overhead, we reserved entire racks.
The smaller cluster (Rhea) contains two partitions: Rhea and GPU. The Rhea partition has 512 node, each with 128 GB of memory and two Intel R Xeon R E5-2650. The GPU partition has only 9 nodes, each with 1 TB of memory and two Intel R Xeon R E5-2695. Reha reduced the time needed for creating tensorflow records for the BFD dataset from 7.5 months (!) to fewer than two days, by converting the original sequential script to distributed processing using MPI. The generation script used two nodes of the GPU partition, with a total of 112 parallel threads.
Google TPU Pod: In 2016, Google introduced tensor processing unit (TPU) as its application-specific integrated circuit optimized for training neural networks. TPUs can be accessed through Google Cloud. Training the protein LMs used the latest TPU generation (V3) with 512 cores. These cores are divided into hosts with each host having access to 8 cores. Consequently, we had access to 64 hosts, and each core had 16 GiB of high-bandwidth memory. Training on the TPUs required access to a virtual machine on Google Cloud and storage on Google Bucket [56]. The workflow as well as the different scales of TPUs are depicted in Fig. 3.

Software
Summit integrates several pre-configured modules which include the most popular libraries and tools required for simulation, deep learning, distributed training and other purposes. We used the IBM Watson Machine Learning module versions 1.6.0 and 1.6.2 for our deep learning training. In contrast to this, the Google Cloud server, which we used for the TPU Pod training, had to be configured manually because only the operating system was installed.
Pytorch was used to train ProtTXL, tensorflow to train ProtBert, ProtAlbert and ProtXLNet. Both libraries used the Horovod framework [6] to train the models on distributed clusters such as Summit. Horovod supports distributed GPU training with minimal change in the code. It supports different backends including MPI, NCCL and IBM PowerAI distributed deep learning (DDL). We tested all three backends and found DDL to be the fastest for our training purpose on Summit. The time needed to finish a single batch with ProtTXL-BFD increased from one to two nodes due to the communication overhead (Fig. 4). After two nodes the communication overhead plateaued, even when scaling up to 936 nodes with 5616 GPUs. Summit has integrated DDL in their Watson Machine Learning module which comes with most DDL libraries including pytorch, tensorflow, apex, DDL and horovod. However, Summit has only a license for using DDL up to 954 nodes. Contrary to Summit, training on TPU Pods did not require any changes in the Tensorflow code to use either a single TPU host or to distribute workload among multiple TPU hosts.
Mixed precision allows to fit bigger models and batch sizes into GPU memory by using 16-bit precision only or a mix of 16-bit and 32-bit precision. Nvidia's APEX library [57] was used for mixed precision training of ProtTXL, due to its pytorch support. As ProtTXL training became instable when training with 16 Bit precision, we switched to almost half precision training (storing all model weights at 16 Bit precision; exception: batch-normalization layers), while keeping a master copy of the model's weights in 32 Bit. We did not use mixed-precision for models trained on TPUs.
Another optimization technique/library crucial for our training on Summit was IBM's large model support (LMS) [58]. Similar to gradient checkpointing [59], LMS virtually extends the GPU memory by outsourcing parts of the model from GPU to main memory. This allows training models larger than the GPU memory. The obvious drawback of LMS is the increase in training time due to shuttling data between CPU and GPU and back. However, the reduced memory consumption of the model allows to increase the batch size, potentially compensating for the communication overhead. Compared to gradient checkpointing, LMS provides easier integration into existing code by operating directly on a computational graph defined by users and automatically adds swap-in and swap-out nodes for transferring tensors from GPU memory to main memory and vice versa. We have tested LMS on ProtTXL as well as ProtBert ( Figure  4). As Pytorch and tensorflow have different strategies to integrate LMS, we also compared the effect of LMS on batchsize, model size and training time using the two different libraries. ProtTXL was used to evaluate the effect of Pytorch's implementation of LMS while ProtBert was trained for a few steps BFD using Summit to evaluate tensorflow's implementation of LMS. Training ProtBert for a few steps was sufficient to assess the effect of LMS on batch-size, model size as well as an estimate of training time. In the end, we used LMS only for ProtTXL to strike a balance between model size and training time. The number of LM parameters could be increased by about 15.6% for ProtTXL-BFD and to 6.6% for ProtBert (5a). Additionally, we could increase the batch size by 700% for ProtTXL-BFD (Figures 5b and 5c). The NV-Link between CPU and GPU on Summit-nodes, reduced the training time for ProtTXL by 60%while it increased by 72% for ProtBert ( Figure 5d).

Unsupervised embeddings from LMs informative
The embeddings extract some of the information learned by the LMs in the first stage of unsupervised learning. To establish that our protein LMs have extracted an understanding akin to the grammar in NLP, we projected the highdimensional embedding space down to two dimensions using t-SNE [45] and visualized proteins according to annotated structural, functional or evolutionary information.
Capturing biophysical features of amino acids. Applying t-SNE to the first embedding layer visualized information extracted by the LMs representing individual amino acids irrespective of their surrounding context (residues next to it). As previously established for another protein LM [24], the t-SNE projections (e.g. ProtBert Fig. 6A) suggested that all LMs captured essential biophysical aspects of amino acids. These included charge, polarity, amino acid Capturing protein structure classes. To assess which aspects of protein structure were captured by the unsupervised LMs, we averaged over the length-dimension of the representations derived from the last layer of each model. This created fixed-size representations for each protein. These we applied to the SCOPe database [46] classifying proteins by their 3D structures (Methods). On the most coarse-grained level, SCOPe distinguishes between all-alpha, all-beta, alpha|beta, alpha&beta, multi-domain, membrane/cell surface and small proteins. ProtTXL (SOM Fig. 14D) and ProtBert (SOM Fig. 15D) produced higher entropy embeddings, while ProtAlbert (SOM Fig. 12D), ProtXLNet (SOM Fig. 13D) and ProtBert-BFD (6D) packed proteins into denser clusters. Consequently, ProtAlbert, Pro-tXLNet and especially ProtBert-BFD embeddings visually separated the proteins better than ProtTXL embeddings. Although sequence length was not explicitly encoded and our pooling squeezed sequences to a fixed vector size, all models separated small from long proteins (light blue, e.g. ProtBert-BFD Fig. 6D). All models also to distinguished between water-soluble and transmembrane proteins (brown, e.g. ProtBert-BFD Fig. 6D) and to some extent secondary structure composition, i.e. all-alpha versus all-beta (dark blue vs. dark green, e.g. ProtBert-BFD Fig. 6D).
Capturing aspects of protein function. Using the same proteins as for SCOPe but different annotations (ECnumbers [60]), we assessed whether the LM embeddings captured aspects of protein function, namely EC numbers (proteins from SCOPe without known ECs were removed, making Figs. 6F and 6D not directly comparable). Although most proteins were scattered for all LMs, ProtTXL clustered some proteins into transferases, hydrolases and oxidoreductases (particular types of enzymes, SOM 14F).
Capturing domains of life and viruses. The following three domains of life are distinguished: archaea, bacteria, and eukarya, while viruses are not considered as life. We used the same SCOPe proteins and fixed-size representations for this analysis. Despite being trained differently (ProtTXL/ProtXLNet predicting next token vs. Prot-Bert/ProtAlbert reconstructing noise), all models captured domain-specific aspects (e.g. ProtBert-BFD Fig. 6E). In general, Eukarya and bacteria were separated best by all LMs, while viruses and archaea formed less homogeneous clusters. When comparing the different LMs, the same trend as for protein structure classes could be observed: ProtTXL (SOM 14E) and ProtBert (SOM 15E) produced higher entropy clusters while ProtAlbert (SOM 15E) and ProtXLNet (SOM 13E) produce visually easier separable clusters.
Using a different protein set [26], we analyzed whether or not the embeddings captured aspects of protein function as proxied by the cellular compartment (also referred to as subcellular localization) and membrane-association. All LMs distinguished some aspects of localization with nuclear and extracellular proteins forming the most coherent clusters (e.g. ProtBert-BFD Fig. 6C). The LMs also picked up the membrane-association, clustering most proteins homogeneously (Fig. 6B).

Embeddings good input for supervised predictions
Successful protein predictions exclusively using embeddings as input constitutes an even more important acid test than any statistical clustering analysis could. Toward this end, we compared secondary structure (per-residue level) and localization (per-protein level) predictions, along with the classification into membrane/non-membrane proteins (per-protein level). Protein LMs were only used as feature extractors, i.e. LMs were not fine-tuned on a specific task.
Per-residue prediction of secondary structure. Secondary structure was predicted by CNNs using only embeddings extracted from the last layer of our pre-trained LMs. All models were evaluated using standard measures for performance (Q3/Q8: three/eight-state per-residue accuracy, i.e. percentage of residues predicted correctly in either of the 3/8 states). Performance differed slightly between different data sets: from Q3(CASP12)=71-76% (interval marks one standard error), over Q3(CB513)=74-83%, to Q3(TS115)=75-84% ( Fig. 7; results for 8-state predictions confined to Fig.  10 Supplementary Material). The computed standard error intervals fail to completely reflect the real spread of the data, because the three data sets were not consistent, i.e. the average over their performance differed by some level of statistical significance. Ultimately, this reflects problems    with each of those data sets: CASP12 was too small, but completely new to all methods compared; CB513 was the largest set (513 proteins), but allowed for substantial redundancy, and TS115 (115 proteins) allowed for even more redundancy. Despite these shortcomings, these data sets enabled direct comparison to state-of-the-art methods using evolutionary information. For simplicity, we use the worst and the best performance among the three data sets in the following to highlight the performance variation depending on the test set. For the four LMs trained on UniRef100 this resulted in Q3(ProtTXL)=71-75, Q3(ProtBert)=75-83, Q3(ProtAlbert)=74-82, and Q3(ProtXLNet)=73-81 (for 8-states: Q8(ProtTXL)=59-63, Q8(ProtBert)=63-72, Q8(ProtAlbert)=62-70 and Q8(ProtXLNet)=62-69). For ProtTXL and ProtBert we could also analyze the influence of the size of the database used to train the LMs: the 10-times larger BFD improved slightly over UniRef100, i.e. Q3(ProtBert-BFD) -Q3(ProtBert)= +1.3 and Q8(ProtBert-BFD) -Q8(ProtBert)= +2.3. For ProtTXL only a minimal improvement could be observed for TS115, i.e. Q3 and Q8 improved by one percentage point. However, none of these differences was statistically significant, especially, in the light of the relatively high variation between test sets.
However, none of the solutions improved in any way over the state-of-the-art methods using evolutionary information (methods left of the dashed vertical line in Figs. 7 and 10), with ProtBert-BFD reducing the gap between those different approaches.
For the binary classification into membrane/nonmembrane proteins (Q2), the trend observed for localization (Q10) largely remained: ProtBert and ProtAlbert performed best (Q2(ProtBert)=89, Q2(ProtAlbert)=88, Fig. 8). However, for Q2 ProtXLNet largely closed the performance gap from Q2 (Q2(ProtXLNet)=87) while ProtTXL again performed worst (Q2(ProtTXL)=85). As for localization, there was little difference between the small (UniRef100) and large (BFD) data set used for generating the LMs: Q2(ProtTXL)  [26]). A simple two-layer neural network is trained on top of fixed-size representations for each protein which were derived by averaging over the length dimension of embeddings extracted from the last layer of the language models. The performance of all our LMs falls short when being compared to an existing approach which uses evolutionary information (DeepLoc). However, transformer-based protein LMs introduced here outperform previously published LSTMbased protein LM approaches (DeepSeqVec) as well as uncontextualized approaches using word2vec (DeepProtVec).
On one hand, the per-protein predictions using only embeddings as input, like those for secondary structure, remained behind the best state-of-the-art methods using evolutionary information (methods left of the dashed vertical line in Fig. 8). On the other hand, performance was substantially and statistically significantly higher for ProtAlbert/ProtBert/ProtTXL/ProtXLNet than for the word2veclike solutions (DeepProtVec in Fig. 8). However, in contrast to the per-residue solutions, the per-protein predictions outperformed some popular methods that did use evolutionary information (Fig. 8), specifically ProtBert-BFD reached a value only a few percentage points below the current stateof-the-art using evolutionary information (Q2(ProtBert-BFD)-Q2(Deeploc)=-3, Q10(ProtBert-BFD)-Q10(DeepLoc)=-4).

Fast predictions from embeddings
Although embedding-based predictions were less accurate than those using evolutionary information, one crucial advantage of representations derived from protein LMs is their speed-up compared to database searches required to generate evolutionary information. This speed-up was quantified by comparing the time required to generate representations for each protein in the human proteome (20.353 proteins with a median sequence length of 415 residues) using our protein LMs or mmseqs2 [62], the fastest tool to gather evolutionary information from protein sequence databases at the moment. The same parameters as in NetSurfP-2.0 [25] were used to search with mmseqs2 the human proteome against two large protein sequence database (UniRef90=113M and UniRef100=216M proteins), i.e. the number of iterations was set to two (profile search) and the maximum number of sequences passing the prefiltering was set to 2.000. For the database search we used an IntelR c XeonR c Scalable Processor "Skylake" Gold 6248 with 40 threads, SSD and 377GB main memory, while protein LMs were run on a single Nvidia P100 with 16GB memory using dynamic batch size based on the variable sequences length. Using the experimental setup described above, mmseqs2 is around 8-or 4-times slower than the fastest LMs (SeqVec and ProtBert, Fig. 9 (a)) when searching UniRef100 or UniRef90, respectively.
When checking the effect of protein sequence length on the inference speed of protein LMs (Fig. 11 (b)), we noticed that SeqVec is the slowest model (9.92s) for long proteins (up to 4096 residues), while ProtBert is the fastest (0.91s). We used only single sequence processing on a Nvidia Titan V with 12GB vRAM.
We also investigated the cross-effect of sequence length and batch-size (see Table 2) on the inference speed of different protein LMs. When using a single Nvidia Titan V on varying batch-sizes (1,16,32) as well as sequence lengths (128, 256, 512), SeqVec provided the fastest inference with an average of 0.02 seconds per protein when using a batch size of 32, followed by ProtBert (0.03s). However, the batchsize of ProtBert could have been further increased on the The time required to generate protein representations for the human proteome (20.353 proteins) is compared using either our protein LMs or mmseqs2 (protein sequence search tool [62] used to generate evolutionary information; NetSurfP-2.0 [25] parameters are used). Here, we used mmseqs2 (red bar) to search each protein in the human proteome against two large protein sequence database (UniRef90 and UniRef100 with 113M and 216M proteins, respectively). Only embedding or search time is reported, i.e. no pre-processing or pre-training was measured. mmseqs2 was run on a Intel R Xeon R Scalable Processor "Skylake" Gold 6248 with 40 threads, SSD and 377GB main memory, while protein LMs were run on a single Nvidia P100 with 16GB memory using dynamic batch size depending on sequence length (blue bar). same hardware but was limited to allow a direct comparison between all models.

DISCUSSION
Supercomputers such as the green energy-driven Summit [1] and Google's cloud TPU Pod [2], combined with optimized libraries such as IBM DDL [7] and Horovod [6] set the stage for training LMs with billions of free parameters on large corpora with terabytes of data in hours or days. Increasing model size improves performance for some NLP applications [14], although the massive data challenges the communication between thousands of nodes and divergence between large batches during training. Here, we presented some solutions to overcome these challenges by fully utilizing 20% of Summit for the training of Trans-formerXL [50], as well as, by using one TPU Pod V3-512 for the training of Bert [48], Albert [49] and XLNet [13] on protein sequences. This translated into the parallel use of 5616 GPUs on Summit or 512 TPU cores on a TPU Pod, while avoiding training divergence with specialized optimizers such as LAMB [53] up to a global batch size of 44K samples (here: proteins).

HPC challenges for larger protein LMs on Summit
Up-scaling LMs to the enormous sizes of protein databases (our largest data set of BFD contained 112-times the number of words in the English Wikipedia) on Summit threw up six main challenges that we addressed as follows.
(1) Architecture: Summit is based on IBM Power processors; most libraries and software tools are written for Intel and AMD. This makes finding compatible tools directly from the developers often challenging. However, the IBM Watson Machine Learning Module, included almost all necessary deep learning libraries, for others common package management tools such as Anaconda [63] were available.
(2) Communication overhead: large-scale training increased the communication overhead. IBM DDL used the least computation time on Summit.
(3) Distributed training: using thousands of GPUs with Tensorflow [3] and Pytorch [4] required extremely efficient distributed communication between nodes and assignment of work loads (tokenized text files) to workers (GPUs). Horovod [6] provided the best training for both of these frameworks on Summit. (4) File sharing: parallel file access can increase run-time. During training, multiple nodes access the same files holding model parameters and logs. To address this, separate log copies for each node were used, while storing a single copy of the model on the master node. Data set files remained shared, not impairing file reading. (5) Pre-processing:, especially tokenization, of batches on the fly increased GPU waiting and CPU processing while reducing storage. For small data sets (few GBs), preprocessing and disk-storing batches before training appeared optimal. For large data sets (TBs), there was a tradeoff between disk space and training time. In our hands, the best solution was using ORNL's Rhea cluster, cutting preprocessing from >215 to <2 days through MPI. (6) Deep learning library: The integration of LMS into Pytorch (ProtTXL) required adjusting only a few parameters; in contrast, Tensorflow (ProtBert) required more code changes. Tensorflow might compensate for this problem by autotuning certain parameters such as the memory usage; however, for our use-case, this failed. The different parameters for Pytorch and Tensorflow resulted in different behaviors with respect to swapping in and out nodes between GPU and CPU. This in turn varied speed and model/batch sizes.

Unsupervised LMs learned simple protein features
Some rudimentary information about how proteins are formed, shaped, and function has been learned by the LMs because all models (ProtBert, ProtAlbert, ProtTXL, ProtXLNet) extracted valuable information as revealed by the embeddings. The basic understanding extended from biophysical features of the amino acid building blocks (e.g. hydrophobicity, charge, and size, Fig. 6A), over classifications of protein structure (Fig. 6D), and protein function (Fig. 6F), to the macroscopic level of the domains of life (Fig. 6E). Global structural properties (e.g. overall secondary structure content, Fig. 6D) and global biochemical properties (e.g. membrane-boundness, Fig. 6B) appeared most distinctive. In contrast, local features relying on short motifs were less separated (EC-numbers: Fig. 6F, localization: Fig. 6C).

Bi-directional beats uni-directional for proteins
The t-SNE and UMAP analyses suggested that the LMs had extracted some level of understanding of the language of life. However, any such statistical differences have ultimately limited validity unless they are predictive. In this sense prediction is the acid test of understanding. To pass this test, we extracted the embeddings learned by the LMs directly as input to predict aspects of protein structure and function (per-residue prediction of secondary structure and per-protein prediction of localization and membrane/nonmembrane). Overall, the supervised results confirmed [17] that evolutionary information scientifically and statistically significantly outperformed LMs not using such information (on all per-residue 7,10 and per-protein tasks 8). However, ProtBert-BFD reduced the gap from embeddings-only input to those approaches. Newer contextual models improved both over previous LM-based approaches [17] (3-5 percentage points in Q3) and over non-contextualized word2vectype approaches [64], [65], [66] (12-17 percentage points in Q3). A merger of models using evolutionary information and embeddings might bring the best.

Bigger data not always better?
LMs were trained on the largest protein database ever used for this purpose, namely BFD [36], more than an order of magnitude larger than UniProt [38], the standard in the field. Although bigger did not equate better for all 2 nd stage predictions, some models clearly improved through more data. Nevertheless, given the immense increase, the highest performance increase remained rather limited with respect to existing LMs [17] (∆Q3=Q3(ProtBert-BFD)-Q3(SeqVec)=4.7%) despite a significant increase in model size (SeqVec=93M vs. ProtBert=420M) and data size (SeqVec=30M vs. ProtBert=216M). Although a ∆Q3 of 4-5 percentage points might imply an improvement that is crucial for the methods using such predictions [68], the value has also to be put into relation to the GPU/TPU hours needed to train those models: while SeqVec needed around 1680 GPU hours, ProtTXL needed 202176 GPU hours and ProtBert-BFD needed 116736 TPU core hours.

Protein LMs reached a ceiling?
Applying techniques from NLP to proteins opens new opportunities to extract information from proteins in a selfsupervised, data-driven way. New protein representations may complement existing solutions, most successful when combining evolutionary information and machine learning [34], [69], [70], [71]. The gain in inference speed for protein LMs compared to traditional models using evolutionary information is so significant that some analyses might prefer much faster and slightly less accurate to better but much slower, for instance, when time or resources for much slower are amiss. Nevertheless, given the experiments described here and in previous work [17], [18], [19], [20], [21], [22], [24], we might expect an upper limit for what protein LMs can learn when using auto-regressive or auto-encoding exclusively. Although this work explicitly addressed the possibility of reaching that limit, we could only conclude: bidirectional models appeared superior over uni-directional models. Answers to the following questions might advance from the status-quo. (1) Why do LSTM-based approaches require fewer parameters and resources while performing similarly at downstream prediction tasks (Q3(ProtBert-BFD)-Q3(SeqVec)=4.7%) compared to Transformer-based approaches? (2) Would the addition of auxiliary tasks such as next-sentence or sentence-order prediction offered by BERT or Albert suit protein sequences? A suggestion might be the usage of structure information [72] or evolutionary relationship [20]. (3) Addressing model vs. data parallelism: Were the large models introduced here still too small to capture all data? Unfortunately, this brings up training efficiency as recently investigated by sparse Transformers [73] or attention optimized with locality-sensitive hashing (LSH) [74] as introduced recently by the Reformer model [75]. (4) Might full precision training stabilize training and speed up convergence by leveraging 32-bit floats? Mixed precision training, employed in this evaluation, uses 16 Bit as well as 32 Bit vectors; this made it more difficult for the model to converge during training. Training the models presented here in full precision might stabilize training and thus provide more informative representations. Overall, our results established that the combination of HPC solutions for building protein LMs and subsequent training of supervised prediction methods scaled up to the largest data sets ever used in the field.

Acknowledgments
The authors thank primarily Tim Karl (TUM) and Jian Kong (TUM) for invaluable help with hardware and software; Inga Weise and Aline Schmidt (both TUM) for support with many other aspects of this work; Florian Matthes (TUM) for his invaluable support and encourage for us. Thanks for invaluable support and feedback from NVIDIA, in particular to to Ulrich Michaelis, Ada Sedova, Geetika Gupta, Axel Koehler, Frederic Pariente, Jonathan Lefman, and Thomas Bradley. No aspect of this work could have been realized without strong support from many at ORNL: thanks; these include John Gounley,Hong-Jun Yoon, Georgia Tourassi, Bill, Brian, Junqi, Graham and Verónica for helping us on fixing issues that occurred while training on Summit. Furthermore, special thanks to Jack Wells for giving us the opportunity to access and work with Summit. From IBM, we would like to thank Nicolas Castet and Bryant Nelson for their help to fix issues and enhance the performance of IBM PowerAI. From Google, we would like to deeply thank Jamie Kinney, Alex Schroeder, Nicole DeSantis, Andrew Stein, Vishal Mishra, Eleazar Ortiz, Nora Limbourg, Cristian Mezzanotte and all TFRC Team for their invaluable support to setup our project on Google Cloud and solve all the related Google TPU and servers issues. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.
This work was supported by a grant from Software Campus through the German Ministry for Research and Education (BMBF: Bundesministerium fuer Bildung und Forschung), a grant from the Alexander von Humboldt foundation through the German Ministry for Research and Education (BMBF: Bundesministerium fuer Bildung und Forschung),and by a grant from the Deutsche Forschungsgemeinschaft (DFG-GZ: RO1320/4-1). We gratefully acknowledge the support of NVIDIA Corporation with the donation of two Titan GPU used for this research development phase. We also want to thank LRZ (Leibniz Rechenzentrum) for providing us access to DGX-1(V100) for the testing phase.
Finally and most importantly, this research used resources of the

SUPPLEMENTARY ONLINE MATERIAL (SOM) -1.1 Supervised Learning
On the level of single residues we also compared the prediction performance of the LMs introduced here on 8-state secondary structure prediction performance (Fig. 10). On the level of single residues, we also compare our protein LMs on results for secondary structure prediction in 8-states as shown in Fig. 10.

-1.2 Protein LM inference speed
The effect of varying sequence lengths (128, 256, 512) and different batch sizes (1,16,32) on the inference time of the protein LMs introduced here is reported in table 2. The effect of sequence length on different LM architectures (LSTM-based SeqVec and transformer-based ProtTrans) was also visualized in figure 11. The x-axis represents different sequence length from 128 up to 4096, while the y-axis represents the time of inference in ms for a single protein with a batch size of 1 on a Nvidia Titan V with 12GB.

-1.3 Unsupervised Learning
Using t-SNE projections, the information content stored within the novel embeddings was qualitatively assessed on various levels, ranging from different aspects of protein function (E.C. numbers, subcellular localization and membrane-boundness) to the level of kingdoms of life, i.e. Eukaryota, Bacteria and Archaea (for completeness here also including Viruses).  (Fig. 7), the features learnt by the proposed Language models (LMs) trained here (ProtBert, ProtAlbert, ProtTXL, ProtXLNet) were also evaluated on eight-state secondary structure prediction (y-axis: Q8). The same datasets (NetSurfP-2.0 [25]), pre-processing steps as well as the same supervised models were used for this analysis, confirming the trend suggested by the three-state secondary structure prediction.   The effect of protein sequence length on the inference time of the protein LMs trained here and a previously published LM (SeqVec) were compared using a Nvidia Titan V with 12GB memory (batch-size=1). Longer proteins take disproportionate long to embed for all language models. In particular, SeqVec was affected due to the sequential nature of the LSTMs used this LM. Transformer-based models require longer for proteins because the attention maps that need to be computed square with sequence length. In contrast to LSTMs, the computation of attention can be parallelized, resulting in lower inference time for long proteins when using transformer-based LMs.  We used t-SNE projections to assess which features the LMs trained here learnt to extract from proteins. Exemplarily for ProtAlbert, we showed that the protein language models trained here captured biophysical-and biochemical properties of single amino acids (Panel A). A redundancy reduced version (30%) of the DeepLoc ( [26]) dataset was used to assess whether ProtAlbert learnt to classify proteins into membrane-bound or water-soluble (Panel B) or according to the cellular compartment they appear in (Panel C). Not all proteins in the set had annotations for both features, making Panels B and C not directly comparable. Further, a redundancy reduced version (40%) of the Structural Classification of Proteins -extended (SCOPe) database was used to assess whether ProtAlbert captured structural (Panel D), functional (Panel F) or lineage-specific (Panel E) features of proteins without any labels. Towards this end, contextualized, fixed-size representations were generated for all proteins in both datasets by mean-pooling over the representations extracted from the last layer of ProtAlbert (average over the length of the protein). The high-dimensional embeddings were projected to 2D using t-SNE. Compared to the some of the other protein LMs trained here (ProtTXL 14 and ProtBert 15, ProtAlbert formed more dense clusters, especially for the projections based on the SCOPe dataset (Panels D, E and F).  We used t-SNE projections to assess which features the LMs trained here learnt to extract from proteins. Exemplarily for ProtTXL, we showed that the protein language models trained here captured biophysical-and biochemical properties of single amino acids (Panel A). A redundancy reduced version (30%) of the DeepLoc ( [26]) dataset was used to assess whether ProtTXL learnt to classify proteins into membrane-bound or water-soluble (Panel B) or according to the cellular compartment they appear in (Panel C). Not all proteins in the set had annotations for both features, making Panels B and C not directly comparable. Further, a redundancy reduced version (40%) of the Structural Classification of Proteins -extended (SCOPe) database was used to assess whether ProtTXL captured structural (Panel D), functional (Panel F) or lineage-specific (Panel E) features of proteins without any labels. Towards this end, contextualized, fixed-size representations were generated for all proteins in both datasets by mean-pooling over the representations extracted from the last layer of ProtTXL (average over the length of the protein).

Soluble
The high-dimensional embeddings were projected to 2D using t-SNE. While generally forming the least dense clusters compared to other LMs trained here, ProtTXL captured certain aspects about protein function (e.g. Transferases, dark green Panel F) that other LMs trained here did not capture.  We used t-SNE projections to assess which features the LMs trained here learnt to extract from proteins. Exemplarily for ProtBert, we showed that the protein language models trained here captured biophysical-and biochemical properties of single amino acids (Panel A). A redundancy reduced version (30%) of the DeepLoc ( [26]) dataset was used to assess whether ProtBert learnt to classify proteins into membrane-bound or water-soluble (Panel B) or according to the cellular compartment they appear in (Panel C). Not all proteins in the set had annotations for both features, making Panels B and C not directly comparable. Further, a redundancy reduced version (40%) of the Structural Classification of Proteins -extended (SCOPe) database was used to assess whether ProtBert captured structural (Panel D), functional (Panel F) or lineage-specific (Panel E) features of proteins without any labels. Towards this end, contextualized, fixed-size representations were generated for all proteins in both datasets by mean-pooling over the representations extracted from the last layer of ProtBert (average over the length of the protein). The high-dimensional embeddings were projected to 2D using t-SNE. ProtBert formed less dense clusters compared to the same model trained on a larger dataset (ProtBert-BFD Fig. 6).