Is attention all you need in medical image analysis? A review

Medical imaging is a key component in clinical diagnosis, treatment planning and clinical trial design, accounting for almost 90% of all healthcare data. CNNs achieved performance gains in medical image analysis (MIA) over the last years. CNNs can efficiently model local pixel interactions and be trained on small-scale MI data. The main disadvantage of typical CNN models is that they ignore global pixel relationships within images, which limits their generalisation ability to understand out-of-distribution data with different 'global' information. The recent progress of Artificial Intelligence gave rise to Transformers, which can learn global relationships from data. However, full Transformer models need to be trained on large-scale data and involve tremendous computational complexity. Attention and Transformer compartments (Transf/Attention) which can well maintain properties for modelling global relationships, have been proposed as lighter alternatives of full Transformers. Recently, there is an increasing trend to co-pollinate complementary local-global properties from CNN and Transf/Attention architectures, which led to a new era of hybrid models. The past years have witnessed substantial growth in hybrid CNN-Transf/Attention models across diverse MIA problems. In this systematic review, we survey existing hybrid CNN-Transf/Attention models, review and unravel key architectural designs, analyse breakthroughs, and evaluate current and future opportunities as well as challenges. We also introduced a comprehensive analysis framework on generalisation opportunities of scientific and clinical impact, based on which new data-driven domain generalisation and adaptation methods can be stimulated.

5].In the era of rapid artificial intelligence (AI) developments and to establish the clinical translation of AI, it is important to review and develop guidelines for innovative AI models.Since the first "deep" convolutional neural network (CNN) developed by Krizhevsky et al. in 2012 which outperformed the previous state-of-the-art (SOTA, non-deep learning) algorithms on the ImageNet dataset [7], CNNs demonstrated numerous performance gains across all MIA tasks: segmentation, reconstruction, synthesis, denoising, registration, classification, and pathology detection [2].However, typical CNNs focus on modelling information through small convolutional filter footprints and shared weights, which comes at the cost of introducing local receptive fields thus, limiting their ability to directly model long-range (global) pixel interactions within images.Hence, despite their important advances, CNN-based networks are still focusing on local-scale modelling, with low generic "local-global" modelling capabilities.Their limited ability to model both local and global information from images adds barriers to model generalisability (e.g.across MIA domains or pathology settings) and transfer learning (from one MI modality to another) properties of pure CNN models [2].
B. Hybridisation with attention convolutions First introduced by Bahdanau et al. in 2014, the attention mechanism was initially designed to learn long-range dependencies in natural language processing and improve machine translation [8].The attention mechanism allows to (soft-)search for a set of positions in a source sentence where the most relevant information is concentrated, encouraging the model to predict a target word based on the context vectors associated with these source positions and all the previous generated target words [8].Following attention, the development of the self-attention mechanism in 2016 was designed so that each position (building block) within a selfattention layer (known as query, key and value) can attend to all positions in the output of the previous layer [9][10], as an additional technique to enhance modelling of long-range dependencies.The introduction of self-attention and attention mechanisms in the Transformer models made it possible to increase the receptive field and thus, became an efficient solution for modelling long-range dependencies from images [10][11][12], with promising results in the field of MIA [13][14][15][16].The Vision Transformer (ViT) models recruit consecutive multi-head selfattention and attention mechanisms in image patches and have been suggested to even fully replace pure CNN models [10].The basic concept in ViT is to convert input images to a series of image patches which in turn are transformed into vectors and can be represented as "words" in a normal Transformer.However, as the relationships between an image patch and all other image patches are computed, the computational complexity of the multi-head self-attention modules in ViT becomes quadratic to image size, adding substantial challenges in the setting of analysing high spatial resolution images.Swin Transformers (ST) were designed to overcome these challenges by performing self-attention in non-overlapped image patches [11,12].Despite this, ST need to consecutively learn a stack of two successive self-attention blocks with regular and shifted windowing configurations, respectively.This adds computational complexity and limits their applicability in MIA tasks such as segmentation, pathology detection, denoising, reconstruction and registration, where dense predictions at the pixel level and learning representations from high content images are necessary.This is one of the main reasons why full ViT and ST models have been limited to medical image classification and object detection tasks [10][11][12][13][14][15][16][17].
To reduce the computational complexity and to address both local and global learning in MIA, self-attention and Transformer blocks were incorporated into CNN model architectures (thereafter called as "hybrid" CNN-Transf/Attention models), giving rise to hybrid models.Current evidence shows that by combining local and global modelling capabilities, these hybrid CNN-Transf/Attention models consistently outperform previous SOTA techniques across different MIA tasks [13][14][15][16][17][18][19][20][21].Hybrid models can potentially be also used to improve model interpretability [22][23][24][25].However, the main drawback of these hybrid models is that they are enormously complex as they have been developed to address particular problems in MIA, which means that their domain generalizability (e.g. from CT to MRI, or from lung to cardiac applications) and transfer learning capabilities can be challenging processes.Given their substantial growth, it is important to methodically assess whether these techniques can generalise across imaging modalities, MIA tasks and clinical applications, or may be over-engineered to specific MIA problems.In this work, we review the evolution of the hybrid models for in vivo MI: magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), ultrasound, x-Rays and retinal imaging.There are numerous recent surveys that describe technical details of CNN models and how these were used to address specific needs in MI [2,[26][27][28][29], as well as some recent survey on ViT in MI [17,[30][31][32][33]. Differing from previous reviews, we developed a comprehensive systematic review based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for hybrid CNN-Transf/Attention models in MI.We categorised published work on hybrid CNN-Transf/Attention models in MI, analysed key architectural designs and quantitatively as well as qualitatively unravelled the evolution of CNN-Transf/Attention models.To improve clarity and understanding on these novel techniques, we critically review whether such hybrid models outperform their pure CNN counterparts.We review technical and computational complexities and discuss domain generalization strategies, based on the MI modality, downstream task, and clinical application.We focus on unravelling the importance and potential drawbacks of hybrid models.Finally, we discuss opportunities, challenges (with mitigations, where applicable) and future perspectives of the post-hybrid model era.We consider these review concepts as important pathways towards harmonising and translating these novel techniques into clinically meaningful MIA.

A. Literature review strategy
We performed a systematic review of CNN-Transf/Attention models in MI published between January 1, 2019 and July 1, 2022 using Scopus, Web of Science and Pubmed, based on the PRISMA framework [34].In our review, we refer to all hybrid models that involve any CNN and Transformer modulesincluding adaptations of self-attention and attention mechanisms, as hybrid CNN-Transf/Attention models.We only considered MI modalities that involve in vivo body imaging and thus, excluded microscopy and digital pathology slide imaging studies.Therefore, we focused on MRI, CT, PET-CT, ultrasound, retinal imaging and x-Rays.
Initial filtering: To broaden the research, we initially mined all publications by searching the following keywords in the abstract, title, and manuscript keywords: (transformer OR selfattention) AND (deep AND learning) OR (convolutional AND neural AND network).This led to 5,222 papers (see PRISMA flow in Fig. 1a).Subsequently, we focused the search by considering all different combinations of relevant keywords in the abstract, title and keywords of each paper, as follows: (transformer OR self-attention ) AND (deep AND learning ) OR (convolutional AND neural AND network) AND (medical AND imaging) OR (magnetic AND resonance AND imaging) OR (MRI) OR (computed AND tomography) OR (CT) OR (ultrasound) OR (positron AND emission AND tomography) OR TITLE-ABS-KEY (retin) OR TITLE-ABS-KEY (x-ray) OR TITLE-ABS-KEY (ray).By adding these terms, we removed all irrelevant to MI papers, which led to 656 papers from all three digital libraries.By excluding conference, review and archived (non-peer-reviewed) papers, we then removed all non-journal publications, leaving 352 journal papers for subsequent analysis.
Title and abstract screening: All authors screened titles and abstracts across all 352 journal peer-reviewed papers and removed all irrelevant to the field of study papers, leaving 128 papers for full text review.
Full text screening: Following full paper review, the authors removed 16 journal papers (14 non-relevant to MI or hybrid model studies and 2 papers not written in English).In total, 112 journal papers (thereafter, referred to as "articles") were included in our review analysis.See also data extraction for paper content that was reviewed.

B. Data extraction
During evaluation of article full texts, we considered the following aspects: (1) 12) size of training and testing data.We also considered (13) if computation expense (total number of parameters) was calculated and (14) whether performance was improved against non-hybrid baseline methods.
Furthermore, we assessed the articles in terms of generalisability following 2 objective criteria: whether a CNN-Transf/Attention architecture was a) trained and/ or evaluated on large unseen testing data, b) analysed data from heterogeneous modalities (e.g., different MRI or CT sequences, or MRI and CT, etc.) and/ or multi-modal analysis (image and text, images and genetics) and/ or was implemented in more than one organ area (e.g., brain and heart).Further, we identify challenges, opportunities and future trends that can be used as suggestions for future work in this field.

III. RESULTS
A. Research trends We studied published work on hybrid CNN-Transf/Attention models in MI and observed a consistent increase of these models in 2021 and 2022, against the first 2 years of our observation window.In Fig. 1a, we present the PRISMA flow used to search and review articles.In Fig. 1b, we initially measured the country origin as derived from each affiliation across all articles (all affiliations were considered across publications).Considering the entire review period (2019-2022), the first ten countries in were: China (74 publications), USA (34), UK (14), India (9), Germany (5), Hong Kong (4), Canada (3), Taiwan (3), South Korea (3) and Italy (3).Table I demonstrates all the articles grouped based on the MI modality, CNN backbone, Transf/Attention model and clinical application (organ) [13][14][15][16][17][18][19][20][21].Implementation details about the data augmentation technique, optimizer, loss function and the metrics used to evaluate the performance of each hybrid model across studies, are presented in Table II.Table III presents whether public and/ or private were analysed and information about the data size.

B. Experimental settings and key architectural designs Medical imaging modality recruited
We reviewed the publication record of the MI modality used per year (Fig. 2a).Most of the studies involved MRI (50 studies), followed by CT (42), retinal imaging ( 14), x-Rays (12), ultrasound (7) and PET-CT (5).Although MRI was less frequent than CT the first 2 years of our observation time frame, it outnumbered CT in the last two years (2021 and 2022).

CNN model used
In Fig. 2b, we demonstrate all CNN backbone models used across studies.Standard CNN architectures have been implemented in most of the studies (40 articles), followed by UNet (30), GAN (14), ResNet (14), DenseNet (7), None (i.e., no CNN backbone-only Transformer model used) (7), fully connected networks (FCN) (6) and VGG (3).Transformer and attention mechanisms Fig. 2c illustrates the evolution of Transf/Attention models recruited per year.It is obvious that self-attention mechanisms have been most widely used (64 studies out of 112 in total), followed by Transformer (22 studies), ViT (9 studies), channeland spatial-attention (6), ST (4), attention (2) and other (11).It is known that to exploit the performance capabilities of full Transformer models, a combination of large data and supercomputer facilities are necessary [10][11][12].In our review, there were numerous studies that either analysed relatively small (i.e., <2,000 images) data (~27%), and/ or private data alone (~29%) and/ or did not report the data size (~21%).Details are presented in Table III.Furthermore, computational resources were not reported in most studies, with only 29 out of 83 reporting the number of model parameters and/ or time for training.Of note, only 8 out of 112 studies (~7%) described use of full original Transformer, ViT or ST models, with the rest ~93% involving: self-attention, channel-and spatial-attention, attention and simplified and light Transformer versions including transformer blocks (of stacked layers), layers, or encoders (Table I).

CNN and Transf/Attention combinations
In Fig. 4b, we show that the incorporation of self-attention mechanisms was the dominant choice distributed across all CNN model types.Transformers were the second most frequent type and was used across all CNN model types, apart from VGG. ST was the third most common type and was only used in conjunction with standard CNN and UNet structures.Based on our findings, mainly "light" (simplified) Transformer blocks, encoders or layers were used across studies (Table I).Novel transfer learning strategies, multi-centre data and/ or increasingly available supercomputer facilities may encourage the use of full Transformer architectures in future work [140,141].However, the current hybrid models have showed performance breakthroughs across studies, highlighting them as powerful and relatively simplified (against large pre-trained models) techniques on the MIA tasks reviewed.For standard CNN and UNet structures, all Transf/Attention mechanisms were used, except for standard attention mechanisms and ViT (Fig. 4b).For GAN models, only selfattention mechanisms were implemented.These results demonstrate that there was a large degree of variability in terms of CNN-Transf/Attention combinations across studies.Moreover, large variability was observed in the data augmentations, loss functions and metrics used to evaluate findings (Table II).

Downstream tasks and clinical applications
Fig. 4c illustrates all Transf/Attention components used across each downstream task.Self-attention was implemented across all organ areas (Fig. 4d).Transformer architectures were used in the brain, lung, multiple organs, heart, retina, neck, and pancreas.ViT and ST were mainly applied to a relatively limited clinical application space: lung and retina, and brain and heart, respectively.Similarly, channel-and spatial attention have been used in multiple organs, brain, lung and chest.
C. Performance and generalization opportunities Most proposed hybrid models have outperformed baseline and previous SOTA comparison methods, across downstream tasks.Although the evaluation metrics used differed considerably across image analysis tasks and studies making direct comparisons challenging (Table 2), there was a clear performance improvement when Transf/Attention mechanisms were used across studies.Some of the studies demonstrated either large (≥ 5%) differences against the best baseline models [21,35,42,46,49,78,79,94,101,108,117,121,122,126,127,135], or moderate (<5%) but consistent improvements across different metrics evaluated [13,18,39,53,54,56,57,62,70,91,105] and/ or data used [98,100,103,105,108].In the following paragraphs, we detail studies that followed our 2 objective generalisation criteria (see Methods): whether a hybrid models was a) trained and/ or evaluated on large unseen testing data (>2,000 images, Table 1), b) analysed data from heterogeneous modalities, and/ or multiple modalities and/ or multiple organ areas.I, Fig. 3b) by Xue et al. [118] and Dong et al [136] describes simultaneous imaging covering the entire body using a single PET-CT session, a dedicated full body technique that has been recently developed [2].The term "multiple organs" describes all studies that imaged more than one organ in the same imaging setting [15,19,35,41,65,66,76,70,71,76,100,108,109,111,113,117,124,127], including whole body studies [118,136].

Segmentation
Image segmentation is an important aspect in the field MIA, as it is a necessary intermediate step towards extracting a region of interest within the organ under investigation [142][143][144][145][146].Although UNet models revolutionised medical image segmentation [147,148], image segmentation remains an open challenge as it relies on strong supervision, hence, a large fraction of labelled data are required.However, there is a considerable "data challenge" barrier, as labels are commonly limited for MI data [2,146].To address this, several approaches have been proposed, such as disentangled representations for semi-supervised learning which can generate accurate segmentations by only using a small fraction of labelled data [146], or GAN techniques to obtain accurate paired synthesis of images and segmentation masks [149].
Cheng et al. proposed a multi-task methodology for simultaneous glioma segmentation from MRI images and parallel classification of genetic profiles for neuro-oncology patients [53].They developed a CNN model with serial ResNet blocks in the encoder and decoder.Between the encoder and decoder, 2 Transformer layers were engineered.Unlike most of the MRI and CNN studies, the authors used multi-parametric MRI data for image segmentation (4 different MRI sequences).The authors compared their method against 10 baseline CNN models and demonstrated superior performance for both tasks.In the context of small but heterogeneous data analyses, Wang et al. designed a CNN encoder-decoder model with residual connections and self-attention modules connected with CNN layers in the encoder [57].The authors demonstrate that their method outperformed all baseline models in identifying COVID-19 lung abnormalities from CT images.They also developed a zero-shot learning strategy based on the same hybrid model, in which a UNet model was applied to predict pseudo-labels in a non-labelled CT dataset, which in turn guided semi-supervised learning.Next to limited labelled data, another challenge in medical image segmentation is the analysis of "less anatomical" and more "biophysical" imaging data, in which imaging physics are modified so that anatomical information at the pixel level is "sacrificed", to "emphasize" perfusion, functional, temporal or other biophysical information [2,[150][151][152][153].Most segmentation algorithms are focusing on imaging sequences that contain enough anatomic (to efficiently guide semantic) representations during training [2,153].Shi et al. developed a powerful method that is capable to analyse 4 different parametric perfusion maps: a) cerebral blood volume, b) cerebral blood flow, c) time to maximum peak and d) mean transit time (of contrast enhancement).They developed two parallel subnetworks to analyse blood flow (a, b) and time (c, d) parameters, simultaneously.Each subnetwork included a CNN model with skip connections between the encoder and decoder.A crossattention module was incorporated between the encoder and decoder for feature fusion.The model was compared against baseline methods (achieving higher and comparable performance, depending on the metric) and evaluated on both public and in-house data.Their method can be promising for other types of perfusion imaging data (MRI, PET, ultrasound) from which similar perfusion maps can be extracted and jointly analysed.Rajamani et al [18] engineered a deformable attention module into a UNet model.Their model (called "DDANet") was trained and tested on a large publicly available CT COVID-19 dataset, achieving superior performance for lung infection segmentation compared to baseline models.The authors suggest that their model can be adapted to generalise in detecting small and irregular lesions for other disease areas.On a retinal image analysis study, Mou et al. developed a versatile curvilinear structure segmentation network, based on dual self-attention modules which can address both 2D and 3D retinal imaging data.In their model named as "CS2Net", they devised two channel-and spatial self-attention mechanisms to generate attention-aware features and capture long-range contextual information.By performing extensive experiments on 9 (2D and 3D) datasets, they demonstrated SOTA performances in detecting curvilinear structures from different imaging modalities.They showed that their technique can work as a generalized approach for retinal morphology analyses.Of note, such hybrid models can be impactful, since retinal imaging is not only used to assess ophthalmic pathologies, but also changes in retinal morphologies that may occur early in a broad spectrum of diseases, such as Alzheimer's [154], cardiac pathologies [155], cerebral small vessel disease [156] and others.Xu et al. replaced 2 layers in the encoder and 1 layer in the decoder of a UNet model with self-attention mechanisms [108].Their hybrid model achieved SOTA performance in segmenting several fetal anatomies, when compared to 6 other models.Segmenting fetal structures from ultrasound is particularly challenging due to moving and fuzzy anatomical organ boundaries [157].Sinha et al. developed a generalizable hybrid model for segmentation of numerous abdominal, cardiovascular and brain structures by analysing different MRI sequences [111].The authors used a ResNet model for initial feature extractions which were then fed into a stack of spatial and channel self-attention mechanisms.They demonstrated superior performance against 6 previous SOTA baselines.The model was capable to perceive a broad spectrum of anatomical (different organs) and semantic (different MRI sequences) information and can therefore potentially benefit future singleand multi-centre analyses [2,29].Xie et al developed a 3D UNet architecture which consisted of 2 cascading UNets both enhanced with self-attention [121].The overall model was trained on a chronic obstructive pulmonary disease (COPD) CT dataset.Following training, the hybrid model was evaluated on COPD data and on an unseen COVID-19 dataset.The model outperformed previous techniques in the detecting several lung nodules in COPD and COVID-19 data.Following further validation using CT data from other organs, this hybrid approach can have broad applicability in terms of detecting small and irregular lesions across different diseases and organ areas.

Pathology detection
Pathology detection is eventually the end goal of any MIA task, with segmentation, localisation and classification commonly being designed as parallel joint tasks.In their noteworthy study, Zhou et al. proposed a cross-supervised method called REFERS, which generates image x-Ray labels from radiology reports, to perform lung pathology detection through image classification [52].The authors employed ViT blocks composed of multi-head self-attention mechanisms, to learn joint representations from multiple radiograph views and corresponding radiology reports.Subsequently, the model performs feature fusion and employs two additional subnetworks for bidirectional visual to textual feature mapping.REFERS was first pre-trained on a source domain x-Ray dataset and then fine-tuned on 4 well-established datasets (target domain with text labels).In their transfer learning strategy, the authors performed fully supervised learning by using structured radiology report labels.Differing from other models, their technique did not require labels during pre-training.The authors also showed that their model outperformed powerful baseline models on all datasets under extremely limited supervision (1% labelled images during fine-tuning).Their model was consistently accurate in detecting several lung pathologies thus, having tremendous potential for real-world applications where labelling is substantially limited.

TABLE II
All the articles grouped based on the downstream MIA task, the number of data augmentation techniques, optimiser, loss function and metric used to evaluate results.To keep the information concise, details are prioritised for the top 2 downstream tasks.All other MIA tasks are organised under the "Other MIA tasks".MIA: medical image analysis, NR: not reported, SGD: stochastic gradient descent, ACC: accuracy.
To address large-scale analysis from different domains, Wood et al. developed a DenseNet-based supervised learning framework for detecting clinically relevant abnormalities from clinical T2-weighted and diffusion-weighted head MRI scans [39].The DenseNet model was trained using a Transformerbased neuroradiology report classifier to generate a labelled dataset of 70,206 examinations from 2 UK hospitals.The Transformer model was trained using a small dataset (N= 5,000) of neuroradiology reports.The authors showed accurate, fast and generalisable classification of abnormal against normal brain MRI between hospitals.This work demonstrated the merit of CNN and Transformer synergy when combined under the same MIA pipeline.In the context of retinal image analysis, Wang et al. have proposed the MsTGANet, a UNet model enhanced with a Transformer block that consists of a series of multi-head self-attention mechanisms incorporated in the encoder, to capture both local and global pixel interactions early in the learning process [45].A series of channel-and spatial attention modules were also inserted between different positions of the encoder and decoder, to efficiently fuse feature semantics during training.At inference, the model predicted labels in nonlabelled data, which were then used as pseudo-labels to augment the dataset, as a semi-supervised learning strategy (in which pseudo-labels were then used to guide semi-supervision).The model outperformed previous SOTA methods in supervised and semi-supervised segmentation tasks.Given the improved performance during semi-supervision, the model can potentially be used to analyse further retinal images and ophthalmic pathologies.

TABLE III
All the articles grouped based on the downstream MIA task, data set (public, private, both) and the data size.To keep the information concise, details are prioritised for the top 2 downstream tasks.All other MIA tasks are organised under the "Other MIA tasks".MIA: medical image analysis, NR: not reported.

Zhang et al. devised a 3D
ResNet block that operated as initial feature extractor before feeding feature information into a selfattention block [21].The authors performed several classification experiments for identifying Alzheimer's disease and mild cognitive impairment from MRI data, showing superior performance against baseline methods.Despite they focused on using T1-weighted data (mainly used for anatomical imaging and does not contain "functional" [158] or "perfusion" [159,160] tissue information), they analysed data from both 1.5T and 3T MRI scanners, which is known that they have differences in the signal-to-noise ratio, imaging content and artefacts [1,2,[158][159][160]. Since their technique was assessed on public data (ADNI), for 2 different brain pathologies and analysed data from different field MR scanners, it can potentially be promising to be assessed across further MRI data, organ areas and pathologies.Another study by Let et al. [35], proposed a CNN encoderdecoder network connected with a self-attention mechanism (called PreSANet) to detect cancer recurrence, distant metastasis and overall patient survival for head and neck cancer patients.The model was trained on public data and was validated on various unseen datasets demonstrating good (~70%) generalisability.Chen [93].Other hybrid model studies demonstrated innovative architectures and high diagnostic accuracies in the setting of pathology detection, however, using smaller datasets [47,49,57,84].Reconstruction Medical image reconstruction aims to form an image representation from raw signals acquired by the scanner [2].Reconstruction of fast acquisitions (of periodically moving organs such as the heart) and/ or low doses (e.g., CT), has important clinical applications.Using relatively small but highly diverse data, Zhou et al. developed a CNN-based method enhanced with self-attention for ultrasound image reconstruction of various organs and tissues [117].Another study demonstrated accurate brain reconstruction by using a CNN with Transformer layers on large MRI data (>30,000 MR images) [56].Tan et al. devised a CNN model with residual connections in which channel-and spatial-attention modules were engineered to reconstruct x-Ray images of the lung, from a large dataset (>55,000 images) [80].Other studies focused on MRI reconstruction and demonstrated accurate and generalisable hybrid models by analysing large and diverse imaging data [114,129,139].Synthesis Image synthesis is an important field as it can address the need of data augmentation across different modalities [2,161].Yang et al. developed a CycleGAN with self-attentions for unsupervised MR-to-CT synthesis, outperforming 2 plain CycleGAN baselines [16].In the field of MR-to-CT synthesis, Dalmaz et al. developed a series of residual Transformer blocks between the encoder and decoder of a CNN [66] and Tomar et al. developed a GAN model with ResNet blocks and selfattention modules for cardiac and brain image synthesis [107].Wei et al. developed a first-of-its kind GAN model with selfattention in the generator and discriminator, that was able to synthesise PET-derived myelin content through the analysis of multi-sequence MRI data [20].

Denoising
Denoising is an important step prior to image quantification as it can enhance signal-to-noise-ratio and remove artefacts [2,[26][27][28][29].Li et al. combined a 3D CNN model with self-attention blocks and an autoencoder perceptual loss (used as a selfsupervised learning module) with CNN-based and GAN-based models.They achieved improved denoising performance against baseline models for chest and abdominal CT images [124].Huang et al. proposed an end-to-end CycleGAN model with criss-cross self-attention and channel-attention mechanisms to reduce noise, remove artefacts and preserve anatomical structures in low-dose dental and abdominal CT images [19].Following further validation, this can be a valuable method for future applications across multiple organs and/ or modalities.To denoise low-count PET images, Xue et al. developed a 3D GAN model with self-attention, achieving improved performance against baseline methods [118].Their method was evaluated on large-scale PET data and showed that it can improve PET image quality, reduce motion artefacts and provide accurate diagnostic information.Localisation Image localisation focuses on detecting the location of an area of interest within MI data [26,27].Tao et al. proposed a ResNet model for initial feature extraction followed by a series of selfattention and cross-attention mechanisms for vertebrae CT localisation and segmentation [13].They demonstrated accurate and generalisable performance across 2 CT datasets.Li et al. developed a DenseNet model parallelised with a ViT block to extract local and global pixel dependencies which were fused before fed into a CNN model.Their technique outperformed baseline models on classification and localisation of several lung abnormalities when trained and tested in a large x-Ray dataset (of >112,000 images) [81].Xie et al. used a pre-trained VGG model enhanced with self-attention to enhance feature extraction before feeding this information into 2 subsequent CNN models [112].They demonstrated accurate fovea localisation in 2 different retinal imaging datasets.

Registration
Image registration is the process of aligning the spatial coordinates of different images into a common geometrical coordinate system.Image registration has wide applications in multi-modal and longitudinal MIA [162,163].Yang et al. developed a plain Transformer encoder with an attention-based decoder model for brain MRI registration, demonstrating accurate results against baselines across 3 different datasets [75].Song et al. proposed a CNN model with Transformer blocks consisted of modified multi-head self-attention for brain MRI registration, producing state-of-the-art registration performance [77].Although analysing brain images from different MRI sequences is challenging, the brain is a static organ that is less prone to misregistration across modalities.Further work is required to expand towards organ areas that are subject to periodic (e.g., heart) and non-periodic (e.g., abdomen) motion, and to register images from different modalities.

Discussion
A. Current opportunities and challenges We studied all the articles from the perspective of 4 professionals (co-authors GP, ND, CW, GY) with extensive experience in deep learning and MI.We identify general challenges and opportunities, from the multi-disciplinary perspective of developers and end-users of these hybrid models in MI.To the best of our knowledge, there is no previous review focusing on these topics and given the heterogeneous architectures of these models, more extensive studies are required in the future to develop data-driven generalization best practices for both developers and end-users.The following points can therefore guide future work and systematic reviews towards solidifying these hybrid models in further, larger and multi-centre studies.Challenges (with mitigations, where applicable): 1) We highlighted studies that have the potential to work as generalisation frameworks.However, additional validation is required to transfer a method from one organ and/ or imaging modality to another, due to data content differences.2) Model architectures varied considerably when similar hybrid models were compared.For example, in studies for which a UNet with self-attention were developed, there were large disparities in terms of how these individual components were combined.3) The previous point indicates that a trial-and-error logic is currently followed for model development, based on which architecture performs optimally for a given dataset.Nevertheless, this is in the opposite direction from developing generalised models and best practices.It is important for the community to initiate discussions about the development of generalisation frameworks, based on certain data-driven boundary conditions: e.g., UNet-full Transformer for cardiac segmentation would be an optimal design if a particular data size, data content (e.g.T1-weighted MRI) and in-house computational capabilities are satisfied.Thus, solid domain generalization strategies to methodically address "why" and "how" to develop model X for data Y are required.4) Developing harmonised implementation protocols is particularly challenging.Implementation aspects such as data augmentation, optimisers, loss functions and pre-processing differed substantially even between studies working on the same problem (e.g., CT for lung segmentation).5) It will be challenging to develop robust interpretation mechanisms for complex local-global pattern recognition models that are not solely based on visualization maps.6) There is an increasing trend in terms of developing causal logic in novel deep learning models, a field known as "Causal Representation Learning, CRL" [164].The aim of CRL is to address open problems in the field such as model generalisation and transfer learning [164,165].Central to CRL is the discovery of high-level causal variables (objects in an image) from low-level observations (embeddings) [164].One of the main challenges that must be addressed is how to factorise causal structures from deep learning embeddings [164,165].CNN-Transf/Attention models have an additional level of complexity due to learning embeddings from both local and global interactions.Thus, there must be a careful consideration regarding how to combine CNN-Transf/Attention models with causality and benefit from the advances of each other [164].Opportunities: 1) Based on performance gains achieved, hybrid model studies can give emphasis on studying generalisation perspectives and standardisation protocols for multi-centre large-scale analyses.2) Given diagnostic performance improvements across diverse studies, there is a potential to enhance early diagnosis and preventative medicine.3) As of 2022, cardiovascular diseases, cancer, stroke, COVID-19, chronic respiratory diseases, diabetes, neurological diseases are the leading causes of death in the USA [166].Most studies (>90%) in our review focused on at least 1 of the organ/ pathology areas corresponding to these leading causes, showing the potential to improve diagnosis and patient outcomes.4) Technical versatility on multi-modal analyses can be achieved through CNN-Transf/Attention (images, natural language, molecular profiles, clinical history), which can yield useful complementary information.5) From a clinical perspective, multi-modal data analysis can enhance precision medicine by combining patient-level information from different modalities.6) Redirect healthcare funds towards improving treatment design and optimisation, based on multi-modal patient characteristics.7) Accelerate the pace to introduce regulations and processes, to establish that AI for MIA can be generalisable and reproducible for certain MI and clinical applications.8) Focus on integrating CNN-Transf/Attention with CRL to enhance model generalisability and trustworthiness in the clinical domain.9) Develop robust transfer learning methods to fully explore the benefits of CNN-Transf/Attention models on out-of-distribution datasets.

Importance and drawbacks
The combination of local and global receptive fields together with reasonable computational power requirements highlights the development of CNN-Transf/Attention models as an important research direction in MIA.The large diversity of architectures even across the same downstream tasks or applications, means that for some of these methods, limited scalability may be one of the main drawbacks [52].Furthermore, full Transformer architectures were limited in our reviewed work, mainly due to relatively small data analysed in some studies, limited computational power and/ or lack of solid transfer learning approaches for pixel-level predictions [52,55,141,167].Further work is required in the field of transfer learning techniques for model generalisation on out-ofdistribution data, to utilise the benefits of full Transformerbased hybrids.

B. Future perspectives of the post-hybrid model era Full transformers, ChatGPT and beyond
The recent developments of ChatGPT large language models (LLM) induced a phenomenal disruption in the field of data analysis and AI.To date, the latest ChatGPT version is based on the GPT-4 (launched on March 2023), reported as the largest LLM trained (>170 trillion parameters) [168][169][170].The main strength of GPT-4 model is that it has been trained on a diverse and broad (in terms of topics) set of internet text including books, articles and websites, using reinforcement learning from human feedback that either rewards or "punishes" the model [170].One of its main capabilities, is that it can perform data predictions through conversational tasks ("responses" to user "queries").ChatGPT models perform Transformer-based and self-supervised learning-derived predictions [170].There have been some first promising approaches involving GPT models for MIA, although mainly limited to image-to-text mapping [104,171,172].Wang et al. used pre-trained CNN models to extract outputs from x-Rays of the lung and applied report generator GPT models to summarise the results and derive a diagnosis in text [171].Another study by Chen et al. used a pre-trained GPT-2 model with a visual encoding part that involved attention, to perform accurate image captioning as evaluated on natural images and X-Ray data [172].Jeblick et al. developed a GPT-based technique that focused on simplifying radiology reports but without using imaging data as inputs [173].Although it can be anticipated that GPT models may expand towards MIA, there are several limitations that need to be considered.First, to the best of our knowledge, there is no GPTbased MIA yet on dense image-level predictions for the MIA tasks we have presented.Local receptive fields that are based on CNN feature extractors may be necessary to perform detailed image analyses, pointing towards the direction of heavier "hybrid models" in the future (CNN-GPT).On that note, it is unknown whether existing self-supervision modules within GPT models may be enough to predict complex organ and tissue pathologies from "high-content" data such as medical images, without the incorporation of "computer vision" CNN components.Furthermore, one important limitation of GPT models is the so called "hallucination effect", which describes the tendency of GPT models to "invent" a term eventually giving "incorrect" responses [174].This can be the case for domains in which GPT models have been less or not yet specialized.Due to regulatory, ethical and organisational considerations from clinical and private MI data owners, we are still at infant stages regarding multi-centre large-scale data analyses that need to be available as data sources for such open code or multi-centre fine-tuning strategies.In addition, the coexistence of available MI and text data is commonly low.To date, given the demonstrated performance of current hybrid models in our review, convolutional Transf/Attention may be "all you need" in medical imaging.

Transfer learning coming from the future
An important yet unsolved aspect in MIA is the democratisation of modelling techniques and data.Transfer learning strategies focusing on increasing performance and generalisability while reducing computational power [141], can serve as democratisation vehicles.However, it is known that transfer learning for image-level predictions has been limited in MIA, compared to "smaller new" models that are trained de novo.Among a large amount of studies demonstrating new models, we highlighted articles that showed robust pre-training with wide fine-tuning on large domain datasets with SOTA performance on testing data [52,55].In their influential study, Liu et al. recently proposed "ConvNext" as a new pure CNN technique which involved several ST-inspired adaptations in the model design and transfer learning method [141].Some of these ST-inspired adaptations were: same augmentation protocol, network width increase, bottleneck model inversion, kernel size enlargement, use of fewer activation functions and normalisation layers.Using ConvNext, they outperformed ST on ImageNet classification tasks while using comparable computing resources.Radford et al. adapted Transformer and ResNet/ ViT models for jointly pre-training paired text and images, respectively [167].By training on web data of 400 million image-text pairs, they demonstrate that can learn image captions from text which can be used as labels for image classification, showcasing a scalable and efficient process to learn image representations.Following pre-training, the text model can describe new visual concepts allowing zero-shot transfer to new tasks and data.Further work on these directions will be particularly important to improve model design, pretraining (on out-of-domain data) and fine-tuning (on domain data) techniques towards efficiently democratising large hybrid models and data access.

Conclusions
In conclusion, hybrid models led to performance gains while demonstrating a big range of generalisation opportunities based on either their large-scale, multi-modal, heterogeneous and/ or broad span of clinical applications.The main challenge of these techniques is to align their large architectural diversity with the current technical and clinical needs in precision and preventative medicine.Based on the opportunities that we have emphasised, we aim to encourage further work on data-driven generalisation frameworks, to develop criteria for the future design of these powerful hybrid techniques.We also seek to inspire further work in the field of transfer learning for generalisation on out-of-distribution data so that models and data can be further democratised.Our review demonstrates the benefits from the co-pollination of CNN and Transformer-inspired models which can open new horizons to further exploit CNN and full Transformers and LLM.Next to these opportunities, our review demonstrated that the benefits of CNN-Transf/Attention outweigh the challenges and may therefore be "all you need" for future validation and standardisation processes in clinical imaging.

Fig. 1a .
Fig. 1a.PRISMA flowchart.The flowchart illustrates inclusion and exclusion of papers at each review stage.b.Publications per year across the top 18 countries (in terms of publications record).Publications with affiliations from multiple countries have been accumulated on a per country basis.

Fig. 2 .
Fig. 2. Publication record over time for the Medical Imaging modality (a), the CNN model type (b), and Transf/Attention architectures (c).In b), the CNN (ALL) term describes all standard CNN models captured across studies: CNN encoder-decoder (E-D), CNN layers, CNN decoder only and All articles grouped based on the clinical application (organ), MI modality, CNN and Transf/Attention model.To keep the information concise, details are prioritised for the top 4 organ areas (brain, lung, multiple organs and retina) in terms of prevalence, the top 2 MI modalities present per organ and the top 3 CNN (ALL) model types per organ.Transf/Attention modules were categorised to: a) Self-Attention, b) Transformer, ViT or ST (full models) and c) Transformer, ViT or ST Encoder(s), Block(s) or Layer(s).All other organs, MI modalities, CNN and Transf/Attention modules are grouped under the term "Other".Studies occurring in >1 Table cell correspond to model combinations.Missing rows of CNN models corresponds to absence of this technique per MI modality.MI: medical imaging, ViT: Vision Transformer, ST: Swin Transformer.

Fig. 3 .
Fig. 3. Publication record over time for the medical image analysis task (a), and the organ/ anatomical area (clinical application) under investigation (b), respectively.The term "whole body" (TableI, Fig.3b) by Xue et al.[118] and Dong et al[136] describes simultaneous imaging covering the entire body using a single PET-CT session, a dedicated full body technique that has been recently developed[2].The term "multiple organs" describes all studies that imaged more than one organ in the same imaging setting[15,19,35,41,65,66,76,70,71,76,100,108,109,111,113,117,124,127], including whole body studies[118,136].

Fig. 4 .
Fig. 4. Publication record showing combinations between the Transformer model/ component type and a) the Medical Imaging et al. developed a ResNet model enhanced with a U-Transformer with multi-level skipconnections and outperformed SOTA baselines on anomaly detection (for pathology detection) from large MRI, CT and retinal imaging data [41].Mondal et al. pre-trained a ViT encoder connected with a FCN layer, to discriminate COVID-19 positive cases from other pneumonia types and normal controls [55].The model was trained on the ImageNet dataset, fine-tuned on a large collection of chest x-Ray and tested on both CT and x-Ray lung data.Zhao et al. proposed a UNet model with residual blocks enhanced with self-attention, to classify malignant from benign thyroid nodules from ultrasound images.The model was evaluated on a large-scale dataset via extensive experiments and achieved high performance (89%) [86].Wu et al. developed a ViT encoder and performed accurate diabetic retinopathy grading from retinal images using a large Kaggle dataset [89].Duong et al. developed an Efficient backbone model connected with a full ViT and demonstrated accurate and generalisable detection of tuberculosis from heterogeneous x-Ray public sources [90].Lin et al. developed a deformable ResNet model with self-attention incorporated to detect irregular and diffused lung nodules due to COVID-19 infection and showed SOTA performance in large and diverse public datasets [92].Shome et al. developed a Transformer encoder connected with an MLP block to perform multi-classification of COVID-19 infection against other pneumonia types and nornal lung, from large x-Ray datasets