CT Perfusion is All We Need: 4D CNN Segmentation of Penumbra and Core in Patients With Suspected Ischemic Stroke

Precise and fast prediction methods for ischemic areas comprised of dead tissue, core, and salvageable tissue, penumbra, in acute ischemic stroke (AIS) patients are of significant clinical interest. They play an essential role in improving diagnosis and treatment planning. Computed Tomography (CT) scan is one of the primary modalities for early assessment in patients with suspected AIS. CT Perfusion (CTP) is often used as a primary assessment to determine stroke location, severity, and volume of ischemic lesions. Current automatic segmentation methods for CTP mostly use already processed 3D parametric maps conventionally used for clinical interpretation by radiologists as input. Alternatively, the raw CTP data is used on a slice-by-slice basis as 2D+time input, where the spatial information over the volume is ignored. In addition, these methods are only interested in segmenting core regions, while predicting penumbra can be essential for treatment planning. This paper investigates different methods to utilize the entire 4D CTP as input to fully exploit the spatio-temporal information, leading us to propose a novel 4D convolution layer. Our comprehensive experiments on a local dataset of 152 patients divided into three groups show that our proposed models generate more precise results than other methods explored. Adopting the proposed 4D mJ-Net, a Dice Coefficient of 0.53 and 0.23 is achieved for segmenting penumbra and core areas, respectively. The code is available on https://github.com/Biomedical-Data-Analysis-Laboratory/4D-mJ-Net.git.


I. INTRODUCTION
A N acute ischemic stroke (AIS) generally occurs if a seg- ment of the supplying arteries of the brain is occluded by a blood clot and prevents the regular flow of oxygen-rich blood to the capillaries in the brain tissue.The ischemic area can roughly be divided into two different types: 1) penumbra, areas where the tissue is still vital but critically hypoperfused [1]; and 2) core, referring to non-salvageable tissue.If blood flow is not restored timely, penumbra regions may develop rapidly into irreversibly damaged core regions.Therefore, a fast and accurate understanding of ischemic areas to plan the treatment and tailor further procedures to every single patient is fundamental.
The recommended modalities for diagnostic imaging in AIS patients are Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) [2].In the initial stages of an acute ischemic stroke, CT Perfusion (CTP) has proven to be a fast and beneficial tool for evaluating both diagnosis and prognosis [3].MRI with Diffusion-weighted imaging (DWI) or non-contrast CT (NCCT) are commonly utilized, after treatment, to assess the final infarct areas (FIAs) [4].These imaging modalities are obtained hours or days after the patient's treatment.
CTP is a four-dimensional (4D) spatio-temporal examination to assess the passage of blood in the brain.It is performed by acquiring a series of three-dimensional (3D) CT scans of a specific portion of the brain at time intervals during contrast agent injection.By using an iodinated contrast agent, density changes in the brain tissue over time can be analyzed.The shape and height of the time density curve depend on the brain tissue's perfusion [5].The raw 4D CTP contains vast data; hence, detecting ischemic strokes can be challenging for neuroradiologists, primarily due to the need for precise diagnosis and for rapid treatment decisions.
To overcome this challenge, medical doctors rely on a set of clinically interpretable parametric maps (PMs) generated from the 4D CTP scan [5], [6].The generated PMs reduce the temporal dimension and produce 3D volumes.Commonly used PMs are cerebral blood flow (CBF), cerebral blood volume (CBV), time-to-maximum (TMAX), and time-to-peak (TTP).CBF represents the blood supply in the brain at a given time; CBV refers to the blood volume present at a given time in a brain region; TMAX is the flow-scaled residue function in the tissue; while TTP shows the time until the contrast agent reaches the tissue [5].Maximum intensity projection (MIP) is also usually generated.MIP images are calculated as the maximum Hounsfield unit (HU) value over the time sequence of the CTP, providing a 3D volume from the 4D acquisition of CTP.Although PMs provide helpful information about ischemic brain tissue, extracting them from the 4D CTP scans limits the spatio-temporal information only to specific subsets of information [7].Several methods [4], [8]- [11] have used thresholding techniques to predict the ischemic areas from the PMs.However, simple thresholding approaches oversimplify the complexity in AIS [12], [13].
Several DNNs have been proposed for AIS applications to predict and segment only the FIAs using CT studies in combination with PMs derived from CTP scans as input [30]- [34].Other researchers have proposed architectures to segment the ischemic lesion (i.e., the core) from the images obtained at hospital admission.Kasasbeh et al. [35] were the first to implement a CNN with a set of PMs as input for ischemic core segmentation.Tomasetti et al. [36] proposed a few-shot self-supervised architecture for hypoperfused (core + penumbra) tissue segmentation using a combination of PMs and raw scans as input of the model.The work demonstrated the feasibility of using self-supervised techniques for the segmentation of this type of tissue.Werdiger et al. [37] introduced a machine learning segmentation method to delineate hypoperfused tissue, demonstrating the capabilities of this methodology over classic thresholding approaches.They used four PMs as input features for their model.However, a general problem with all the methods mentioned above is relying on commercial CTP software and using heavily preprocessed information (i.e., PMs) rather than taking advantage of the totality of the raw 4D CTP scans.
DNNs are more suitable for discovering information from raw data.Nevertheless, relying on raw data (directly exploit-ing the temporal and spatial dimensions) is scarcely explored in the literature for AIS applications.The task is challenging because of the low contrast and low signal-to-noise ratio in the CTP scans.Relatively few studies proposed DNN models with encouraging results, exploiting the temporal dimension to assess acute stroke lesions using 4D CTP scans.Soltanpour et al. [7] utilized CTP images to create 2D matrices in which each row is a voxel, and each column is a time point.The 2D matrices are used as input for a model that shows encouraging results in differentiating healthy tissue from FIAs.Vries et al. [38] promoted a 2D+time symmetry-aware CNN-based architecture to segment FIAs using solely CTP scans.Their work estimated the irreversibly damaged areas, demonstrating the possibilities of using 4D CTP images for this task.Bertels et al. [39] used a U-Net-like structure for segmenting FIAs using CTP scans as an input plus contra-lateral information.Results were promising, but further research is needed due to their far-from-ideal registration of the contra-lateral information.Rosa et al. [40] introduced a two-step model for estimating FIAs using the 4D CTP series as input.They first generate an arterial input function and later deconvolve it with a singular value decomposition approach to find the infarction.Amador et al. [41] designed a framework based on Temporal Convolution Network to predict AIS FIAs from 4D CTP studies.Due to memory constraints, they independently processed each 2D slice of the 4D CTP dataset.In their recent work, Amador et al. [42] also proposed an extension of their model where 3D+time tensors of the ipsilateral stroke hemisphere are used as input to predict FIAs.Robben et al. [43] proposed a DNN that predicts the FIAs directly from 4D raw CTP plus patients metadata.Their proposed architecture relied on a series of 3D Convolution layers; the input is a list of 4D CTP scans sampled at different resolutions.Their method presented promising segmentation results; however, their main target was to estimate final infarct volume allowing clinicians to simulate different treatments and gain insight for the procedures.They were not taking into consideration the penumbra in their study.Plus, the quality of the ground truth images is debatable since they rely on NCCT followup images acquired between 24 hours and five days after patient's admission.It has been reported that FIAs can grow after 24 hours in NCCT measurements [44].
All of the segmentation methods mentioned above rely on ground truth labels obtained from DWI and/or NCCT hours or days after the patient's admission since they are predicting FIAs.Even though follow-up images (DWI and NCCT) represent the gold standard for estimating core [4], there are some limitations with these techniques [45], [46].Follow-up images can only be used to assess FIAs but not penumbra regions.Plus, some studies have demonstrated that the detected FIAs can be partially reverse in DWI performed in an early time window [46]- [48].
The studies of [7], [38]- [43] indicate the potential of 4D data in AIS prediction.However, they all consider FIAs and an appropriate method is still required to simultaneously handle the spatio-temporal information for segmenting the ischemic core and penumbra regions.Understanding the penumbra's extension during the ischemic stroke's first stages is crucial for treatment decision [49], [50].To the best of our knowledge, our work in Tomasetti et al. [50] using machine learning, and later in Tomasetti et al. [51], [52] using DNN, were the first and only to segment both core and penumbra areas.In [50], [52] the PMs were used as input, in [51] 2D + time CTP images were segmented slice-by-slice.
In this paper, we present and investigate three novel models to segment the two ischemic regions (core and penumbra), where the input is the entire 4D CTP scans arranged in different ways to exploit the spatio-temporal nature of the data.We compare all models with previous work based on PMs [52] and slice-by-slice CTP [51], and two methods proposed by Amador et al. [41], [42].
The main contributions of this work can be summarized in four points: 1) We develop a novel 4D convolution layer that processes the 4D raw data.2) We implement the 4D convolution layer and propose a DNN model, 4D mJ-Net, to segment ischemic core and penumbra areas from 4D CTP scans.3) We extend previous methods [41], [51] to be used directly with 4D CTP scans as input.4) To assess the results, we use manual annotations obtained by two expert neuroradiologists from the 4D CTP data upon patients' admission.We also demonstrate the feasibility of our proposed methods by comparing their performances with existing models that rely on different inputs.

II. DATA MATERIAL
A section of the brain is repeatedly scanned during the passage of 40 ml iodine-containing contrast agent (Omnipaque 350 mg/ml) and 40 ml isotonic saline in a cubital vein with a flow rate of 6 ml/s to highlights changes in the tissue; the scan delay was four seconds.Each brain slice contains a fixed number of time points t max representing the temporal dimension.The width and height of each image are 512×512 pixels with a resolution of 0.4258 mm/pixel and a slice thickness of 5mm.The first twenty time points are acquired every 1s, while the remaining ten images are every 2s.CTP scans from 152 patients collected between January 2014 and August 2020 formed the dataset.137 of these patients had an AIS with a visible perfusion deficit.During the diagnostic workup, the remaining 15 patients were admitted with suspected strokes but were determined not to have suffered from a stroke episode after the diagnostic workup.Raw perfusion data from the CTP examination was used to generate PMs with the software ''syngo.via''from Siemens Healthineers, with manufacturer default settings.The arterial input function was automatically selected, with few exceptions where it was chosen manually (e.g., severe cardiac failure).
The patients were divided into the following groups: 77 patients with large vessel occlusion (LVO), 60 patients with non-large vessel occlusion (Non-LVO), and the remaining 15 patients without ischemic stroke (WIS).Based on CT angiography, LVO was defined as occlusion of any of the following arteries: the internal carotid artery, M1 and proximal M2 segment of the middle cerebral artery, A1 segment of the anterior cerebral artery, P1 segment of the posterior cerebral artery, basilar artery, and vertebral artery occlusion.Non-LVO was defined as patients with perfusion deficit with more distal artery occlusion or with perfusion deficit without visible artery occlusion.The dataset is randomly split into a  1.

A. GROUND TRUTH
The manual annotations are based on the entire CT dataset, including the PMs derived from CTP. MRI performed during the first days after hospital admission was also utilized.Two expert neuroradiologists manually annotated ground truth images by utilizing the complete set of the CT examination (NCCT, CT angiography, and CTP), which includes PMs from the CTP (CBV, CBF, TTP, TMAX) and the MIP images.The PMs were visually assessed.In general, ischemic regions with increased TTP and TMAX and reduced CBF but preserved CBV were considered as penumbra, while areas with additionally reduced CBV were deemed as core.Additionally, the MRI examination, including DWI, was conducted within 1 to 3 days after the CT examination, and clinical information was used to assist in generating the ground truth images.The annotations were performed using an in-house developed software in Matlab 1 .
Table 2 presents the various formal notations adopted in the remainder of the paper.Let the data obtained from a CTP scan be defined as a 4D tensor V ∈ R (X ×Y ×Z ×T ) .After a series of pre-processing steps (details in Sec.III-B), we define the 4D tensor as V ∈ R (X ×Y ×Z ×T ) .The four dimensions of a CTP scan are defined as width (X ), height (Y ), depth (Z ), and time (T ).The list of time points in the time dimension is given by t where t max is the last time point of the list.We indicate how the notation superscript adopts the time dimension in the various inputs.Furthermore, we define as the list of brain slices in the depth dimension, where z max corresponds to the last slice.We illustrate how the depth dimension is being used in the 1 The code is publicly available at https://github.com/Biomedical-Data-Analysis-Laboratory/CTP-Matlab VOLUME 11, 2023 inputs through the notation subscript.Fig. 3 displays the input combination of all the techniques.TABLE 2: List of formal notations used in the paper.
List of all the time points.
List of all the slices.zmax Last slice in the depth dimension.
Set of indexes i, plus its neighbours i − 1 and i + 1.
2D probability output of brain slice z i .

•
Input after pre-processing steps.
List of 2D images Iz i for all the time points t.Input for 2D-TCN (Sec.IV-A3).
3D volume of slices z I for a time point t j .
List of 3D volumes of slices z I for all the time points t.Input for 3D-TCN (Sec.IV-B1).
List of 2D+time volumes for slices z I .Input for 3D+time mJ-Net (Sec.IV-B2).
4D Tensor of slices z I over all the time points t.Input for 4D mJ-Net (Sec.IV-B3).
All methods return a 3D output, segmenting the images P zi slice-by-slice.The segmented 2D image P zi corresponds to a brain slice z i at index i.The predicted image P zi contains brain tissue segmented with the classes C (if any): healthy brain, penumbra, and core.

B. PRE-PROCESSING STEPS
The 4D CTP dataset underwent a series of pre-processing steps to extract brain tissue from the raw CTP scans.The raw CTP studies are saved as DICOM files.The pre-processing steps can be summarized as follow: 1) Co-registration of all the images in the 4D CTP scan using the first time point image as the frame of reference in order to correct possible motion artifacts.An intensity-based image registration with similarity transformation was used in this step.2) All the registered CTP scans were encoded into HU values to have a known quantitative scale to describe radiodensity efficiently.To calculate the HU value for a voxel V with a rescale slope (RS) and a rescale intercept (RI) extracted from the DICOM header: V (x, y, z, t) HU = V (x, y, z, t) • RS + RI 3) Brain extraction of CT studies plays an essential role in stroke imaging research [53], [54].An automatic brain extraction method designed by Najm et al. [54] was selected for this purpose due to its proven efficiency with CT datasets and public availability.4) Gamma correction (γ = 0.5) and histogram equalization were also performed after step 3. 5) Finally, z-score (z) on the enhanced 4D tensor is applied to normally distribute the data.
The input for all the methods (except for the Multiinput PMs's approach) follows the same pre-processing steps.These steps were performed to improve the quality of the images by enhancing the contrast.An additional resampling step for all the images was performed to ensure uniform distribution in temporal dimension.The effects of the preprocessing steps and re-sampling are examined in an ablation study in Sec.V-C.

C. CONVOLUTION IN MANY DIMENSIONS
A handful of DNN methods have been proposed to exploit 4D data with a full 4D Convolution (4D-Conv) layer.Gessert et al. [55] and Bengs et al. [56] proposed a 4D spatiotemporal convolutional network to optical coherence tomography force estimation.They demonstrated that using the full 4D data information yields better performances than 3D data.Myronenko et al. [57] introduced a 4D CNN to segment cardiac volumetric sequences using CT scans, showing the advantages of using their proposed architecture compared to a classic 3D CNN.
In the remainder of this manuscript, let define I (x, y, z, t) ∈ R4 , H(w, h, d, p) ∈ R 4 as a 4D tensor and a 4D kernel, respectively.The x and w indicate the width of the 4D structures; y and h express the height dimension; z and d define the depth dimension, while t and p represent the time dimension of the 4D structures.Like a 3D Convolution (3D-Conv) can be represented as the sum of multiple 2D Convolution (2D-Conv) along the depth dimension, a 4D-Conv operation can be described as the sum of multiple 3D-Conv along the temporal dimension.The loop rearrangement to avoid repeated 3D-Conv operations allows a true (non-separable) 4D convolution operation [57].
Thereafter, let us define a 2D-Conv g ′′ (x, y, z) with a 2D kernel H(w, h) ∈ R 2 and a 3D input I (x, y, z) ∈ R 3 .Since the convolution operation is performed slice by slice over the third dimension, the 3D input can be seen as a list of 2D input I (x, y, z) = {I (x, y, z m )|∀m ∈ {1, . . ., z max }}, where I (x, y, z m ) ∈ R 2 are the coordinates (x, y) at slice z m :  Furthermore, let define a 3D-Conv operation g ′′′ (x, y, z) of a 3D kernel H(w, h, d) ∈ R 3 and a 3D input volume I (x, y, z) ∈ R 3 as: ⌋ is the half-depth of H. Fig. 1b displays an example of a 3D-Conv with a 3D input and 3D kernel.If we define the 3D input I (x, y, z) as before, then the output of a 3D-Conv g ′′′ (x, y, z) can be rewritten as the sum of multiple 2D-Conv operations over the third dimension: Moreover, lets define a 3D-Conv operation g ′′′ (x, y, z, t) of a 3D kernel H(w, h, d) ∈ R 3 with a 4D input tensor I (x, y, z, t) ∈ R 4 .The 4D tensor I (x, y, z, t) can be seen as a list of I (x, y, z, t) = [I (x, y, z, t n )|∀n ∈ {1, . . ., t max }], where each element in the list corresponds to the coordinates (x, y, z) of a t n element in the temporal dimension.Then, the 3D-Conv operation g ′′′ (x, y, z, t) can be seen as: If we use a 2D+time kernel H(w, h, p) ∈ R 3 and we define I (x, y, z, t) as a list of I (x, y, z, t) = [I (x, y, z m , t)|∀m ∈ {1, . . ., z max }], where each element in the list corresponds to the coordinates (x, y, t) of a slice z m , then the 3D-Conv operation g ′′′ (x, y, z, t) can be rewritten as: where p ≡ ⌊ 1 2 (p−1)⌋ is the temporal dimension of the kernel H halved.
A 4D-Conv g ′′′′ (x, y, z, t) of a 4D input I (x, y, z, t) ∈ R 4 and a 4D kernel H(w, h, d, p) ∈ R 4 can be defined as: Finally, in a similar way, we define a 4D-Conv g ′′′′ (x, y, z, t) and a 4D kernel H(w, h, d, p) as the sum of multiple 3D-Conv over a specific dimension, i.e., the third dimension.The 4D kernel H(w, h, d, p) can be seen as a list of H(w, h, k, p)|∀k ∈ d − 1, where H(w, h, k, p) is a 2D+time volume of the kth slice over the entire p elements in the temporal dimension:   ) .A series of 3D-Conv operations are calculated over a 4D input.Several groups (G i−1 , G i , G i+1 ) are generated, one for each volume involved in the operation.
where I (x, y, z + d − k, t) ∈ R 3 is a 2D+time volume at slice z+ d −k over the totality t element in the temporal dimension, and Fig. 2 gives a visual explanation of the proposed 4D-Conv layer.The input for our 4D-Conv layer is a 4D tensor V t z I : the 2D+time volume of the ith brain slice V t zi over all the time points t, and the two 2D+time volumes of the neighboring brain slices ( V t zi−1 and V t zi+1 ).The 4D-Conv layer uses a loop rearrangement with three 3D-Conv groups (G i−1 , G i , G i+1 ), one for each volume slice involved in the operation.In each group G i , several 3D-Conv layers are created.All convolution layers in each group shared the weights.Each 3D-Conv layer is used for a single input volume.The number of layers depends on the legal subscript indexes, i.e., for the group G i−1 , there are two 3D-Conv layers since the legal subscript indexes are {i − 1, i}: the indexes are given by the current volume slice z i−1 and the only neighboring volume slice z i .The output of each group is a set of feature volumes summed together.The resulting feature volumes are stacked together to keep the same dimension as the input.

IV. EXISTING METHODS & PROPOSED 4D METHODS
In this paper, we present three novel deep learning (DL) approaches that accommodate 4D input data.In the remainder of the paper, they are called 3D-TCN (Sec.IV-B1), 3D+time mJ-Net (Sec.IV-B2), and 4D mJ-Net (Sec.IV-B3).
Together with the proposed approaches, we implemented and compared the 2D-TCN [41] (Sec.IV-A3), the 3D-TCN-SE [42] (Sec.IV-A4), and the mJ-Net [51] (Sec.IV-A2) to validate the performances of our models.We also compare the models with a method that uses a set of PMs as input [52].In the remainder of the paper, we call this architecture Multiinput PMs (Sec.IV-A1).Fig. 3 visually compares the input utilized for the various approaches.

A. EXISTING METHODS 1) Approach 1: Multi-input PMs
The Multi-input PMs model was proposed in [52].This architecture was used as a baseline study because all the PMs were input for the model.The input for the architecture is a list of PMs for each brain slice z i : PMs zi , as shown in Table 3.The loss function implemented is the Focal Tversky loss (FTL) [58]; for a specific class c, the FTL is defined as: where γ ≥ 1 is a hyper-parameter that forces the loss function to focus more on less accurate predictions that have been misclassified [58].Denoting x i,c ∈ [0, 1] as the probability of the ith predicted pixel to belong to class c; y i,c ∈ {0, 1} as the pixel i with class c in a ground truth image, TI c is the Tversky index (TI) for a class c defined as: where x i,c = 1 − x i,c is the probability that the ith pixel is not of class c, and y i,c = 1 − y i,c represents the complement of pixel i in a ground truth image.The hyper-parameters α and β control the trade-off between precision and recall.We refer the reader to [52] for a more extensive explanation and discussion about this approach.

2) Approach 2: mJ-Net
The mJ-Net approach was proposed in [51].As presented in Table 3, the input V t zi (x, y) for mJ-Net is a 2D+time volume of the same brain slice z i at index i over all the time points t.We define the dimension of this input as 2D+time; the first dimension of the input is time.
The loss function used for this method is the soft Dice Coefficient loss (SDCL) [59].The SDCL is a modified version of the Dice Coefficient score mainly used in medical domains where the classes to predict are highly unbalanced due to a small region of interest compared to the background of the scans.The SDCL can be written as: The first section of the model contains 3D-Conv layers to extract information from the temporal dimension, while the second part follows the classic U-Net structure [24].For more details about the mJ-Net approach, we refer the reader to [51].
3) Approach 3: 2D Temporal Convolutional Network For comparison reasons, we implemented the method proposed by Amador et al. [41].We call this architecture 2D-TCN in the remainder of the paper.

Approach 1
Multi-input PMs [37] 3D-TCN-SE [33] Approach FIGURE 3: Visual comparison of the input for each implemented approach.Every 4D CTP patient's study V ∈ R (X ×Y ×Z ×T ) undergoes a series of pre-processing steps to enhance each CTP scan.Approach 1 (Multi-input PMs) [52] accepts a list of PMs generated from a CTP study in input.Approach 2 (mJ-Net) [51] use a 2D+time volume V t zi (x, y) as input.Approach 3 (2D-TCN ) and Approach 4 (3D-TCN-SE) follows the model proposed by Amador et al. [41] and Amador et al. [42], respectively.The resulting 4D tensor is fed to one of the approaches.The proposed approach 5 (3D-TCN ), approach 6 (3D+time mJ-Net), and approach 7 (4D mJ-Net) take in input the entire 4D CTP processed data V .conventional neural networks in different tasks [60].Moreover, a TCN has a lower memory requirement for training than other Recurrent Neural Networks [60].The 2D-TCN was trained with the exact implementation as the original work (see Table 3 for more details).The 2D-TCN model receives the 4D CTP scans in input re-sampled to 1 second per time point.The 4D input is processed as a list of 2D brain slices z i for each time point t.Thus, the actual input for the 2D-TCN is a list V t zi , as mentioned in Table 3.The list V t zi contains all the time points of the brain slice z i .Every 2D input image of the list Ĩ tj zi at time point t j is fed to a 2D encoder E For more details about the 2D-TCN approach, we refer the reader to [41].We implemented a similar method from Amador et al. [42].
We call this approach 3D-TCN-SE due to using a single encoder (SE).Fig. 4b shows a simplified version of the proposed architecture, emphasizing the input difference between the 2D-TCN and this model.The 3D-TCN-SE model receives the 4D CTP scans resampled to 1 second per time point.The input is a list of t 3D volumes V t z I (Table 3).Each 3D input volume in the list V tj z I (x, y) corresponds to the concatenation of the ith brain slice z i plus its neighbouring slices z i−1 and z i+1 over a specific time point t j .The 3D-TCN-SE approach uses a single encoder E 3D-TCN for all the elements in the input list.It is worth mentioning that the 3D-TCN-SE model is trained with the entire brain images for comparison reasons and not with just the ipsilateral hemisphere, as in the original paper [42].
The output is a 2D image P zi (x, y).The first max-pooling layer of each block in the convolution section has a pool size of (2,1,1) to reduce the first dimension by a factor of 2. The second max-pooling layer uses a pool size of (3,1,1), while the third has a pool size of (5,1,1).selection of these sizes is due to the time dimension.The remaining max-pooling have a pool size of Attention utilize a kernel of dimension 3 and a Leaky ReLU The 2D Upsampling layers have an factor of 2. The last convolution layer has a kernel of 1 a Softmax activation function to produce probability score for every class.

B. PROPOSED 4D METHODS
All the proposed methods 2 use the entire 4D CTP scan as input; the main difference lies in how the 4D input is processed.The 3D-TCN is based on a 2D-TCN [41], modified to receive a list of 3D input volumes.The 3D+time mJ-Net inputs a list of 2D+time brain volumes from a CTP dataset, while the 4D mJ-Net uses the entire 4D structure of a CTP dataset as input.Fig. 5 compares these architectures with their respective inputs.

1) Approach 5: 3D Temporal Convolutional Network
We extend the architecture proposed by Amador et al. [41] for our application to exploit further the information in the depth dimension.In the remainder of the paper, we call our architecture 3D-TCN.The main differences between the proposed 3D-TCN and the 3D-TCN-SE (Sec.IV-A4) rely on the usage of a 3D encoder for each input element, instead of a 3D single encoder, plus the possibility to segment both core and penumbra regions, in comparison with segmenting only the core areas. 2The code is publicly available at https://github.com/Biomedical-Data-Analysis-Laboratory/4D-mJ-Net The 4D CTP scans for the 3D-TCN are all re-sampled to 1 second per time point.As described in Sec.IV-A3, the 3D-TCN architecture feeds each element of the input list V tj z I at time point t j to a specific 3D encoder E tj 3D-TCN to extract low-level features.Each E tj 3D-TCN encoder returns a (4×4×C) feature vector, where C corresponds to the number of channels.Each feature vector is merged to create a single input ETOT 3D-TCN = [E tj 3D-TCN |∀t j ∈ t].The ETOT 3D-TCN is used in the TCN, which generates a one-dimensional vector O 3D-TCN of 64 elements.The TCN's output O 3D-TCN is then given in input to the decoder to create the final predicted 2D image P zi (x, y) of a brain slice z i at index i.

2) Approach 6: 3D+time mJ-Net
We propose a model called 3D+time mJ-Net, an extension of the work of Tomasetti et al. [51].The proposed model inputs a list of 2D+time matrices; thus, the dimension of this input can be defined as 3D+time.The input and output are presented in Table 3, whereas a visual example of the input for the model is given in Fig. 3.Each element of the input list coincides with a possible input for the mJ-Net (details in Sec.IV-A2).V t z I consists of a list of 2D+time volumes, where z ) is the concatenation of a 2D+time volume V t zi of a brain slice z i at index i over all the time points t plus its neighbouring brain slice volumes ( V t zi−1 , V t zi+1 ).Two MonteCarlo dropout layers [61] are added at the end of the 4D and 2D blocks.The rate was set to 50%.These layers were added to reduce uncertainties in the final predictions.The last convolution layer has a kernel of (1×1) and a Softmax activation function to produce a probability score for every class.
the ith slice z i analyzed and its neighboring slices z i−1 and z i+1 .In case the index i corresponds to the first (or last) brain slice, V t zi−1 (and equivalently V t zi+1 ) is set equal to V t zi .Every 2D+time volume from the input list V t zi (x, y) is trained separately in the model through a series of encoders (E zi−1 , E zi , E zi+1 ) composed of 3D-Conv and 2D-Conv layers.Each section is independent of the other: the convolution layers have no shared weights.
Fig. 6 illustrates the model's architecture.Attention layers [62] and 2D Upsampling layers were implemented in the decoder section.Attention layers benefit the architecture by focusing on target structures and help increase the segmentation performances.With the sole exception of the last convolution layer, each convolution layer uses a kernel of dimension 3 and a Leaky ReLU activation function [63] with α = 1/3.

3) 7: 4D mJ-Net
We propose another model called 4D mJ-Net.We introduce this method to avoid the three paths to process the 4D data presented in the previous architecture (Sec.IV-B2).We still use a sliding window technique over the depth dimension to simultaneously limit the amount of input data fed to the model, using three consecutive brain slices at a time.Like the 3D+time mJ-Net model, also this approach is an extension of the work of Tomasetti et al. [51].Information on this approach is given in Table 3.The 4D input tensor V t z I contains both the time dimension and the neighboring slices of the ith brain slice.The V t z I is a concatenation of a 2D+time volume V t zi of a brain slice z i at index i over all the time points t together with its neighbouring 2D+time volumes V t zi−1 , V t zi+1 .This model can be considered an early-fusion approach since the 4D input tensors V t zi−1 , V t zi , V t zi+1 are concatenated before being fed to the encoder's model.
The proposed 4D mJ-Net model is a combination of both 3D+time mJ-Net (Sec.IV-B2) and mJ-Net (Sec.IV-A2).The proposed approach uses the same input type that the 3D+time mJ-Net exploits.However, rather than a list of 2D+time volumes, the model concatenates the input into a single 4D tensor of dimensions (X × Y × Z × T ).Unlike 1D, 2D, and 3D Convolution layers, 4D Convolution layers are not available in public DL frameworks (i.e., Keras3 or PyTorch4 ).Thus, for this model, we implemented a novel 4D-Conv layer (details in Sec.III-C) which uses the convolutional layers defined in the public DL frameworks to replicate a 4D convolution operation.
The architecture of the 4D mJ-Net is displayed in Fig. 7.No Attention layers [62] were included in these models due to a considerable performance decline.The output of the 4D-Conv layers is a tensor where the temporal dimension has been squeezed and reduced; information are extrapolated from the temporal dimension.Thus the output resulting from the 4D-Conv layers contains only three dimensions (X × Y × Z ) plus the channel dimension.3D-Conv layers are implemented to reduce the depth dimension Z and produce a 2D vector (X × Y ) plus the channel dimension.
A weighted categorical cross-entropy (WCC) loss [64] was the loss function implemented for this method.The loss can be written as: where w i,c corresponds to the weight of the ith pixel for a class c ∈ C; x i,c is the ith predicted pixel, and y i,c is the corresponding ground truth pixel.

C. IMPLEMENTATION DETAILS
Approach 2 [51]: -"-SDCL [59] Approach 3 [41]: Approach 5: Table 3 provides information about all the methods.All the methods mentioned in Sec.IV-A and Sec.IV-B utilize Adam as the optimizer [65] with a learning rate of 0.0003 and a step-based decay rate of 0.95 every ten epochs.The batch size is set to 2. An early stopping function is called if there is no decrement in the validation loss after 25 epochs.During training, L1 and L2 regularizations are applied in the kernels, plus a max norm constraint is also used in the kernel and bias weights.All experiments were implemented in Python using Keras (2.3.1) with Tensorflow as the backend and trained using an NVIDIA Tesla V100 GPU (32 GB memory).

V. EXPERIMENTS & RESULTS
We assess the proposed methods on a local dataset of CTP scans from 152 patients (Sec.II).All experiments are performed with the same training set and evaluated over the validation set (details in Table 1).The test set is used only to make predictions with the best models with CTP scans that the methods have not seen before.Since the Non-LVO group has smaller ischemic areas than the LVO patients, we set a higher penalty for every misclassification of penumbra and core classes for this sub-group during training.

A. EVALUATION METRICS
Three evaluation metrics are used to assess the various experiments' models.The Dice Coefficient (DC), the Hausdorff Distance (HD) [66], and the absolute difference in the volumes (∆V ).We employ the DC to compare the model predictions with the ground truth segmentations.The DC between two segmentations x and y is given by the following equation: where the range for the DC is [0, 1]; thus a DC(x, y) = 1 corresponds to a perfect match between the prediction x and ground truth y segmentations.
The HD measures how two subsets (A, B) are distant from other, and it is formulated as follows: where h(A, B) = max a∈A min b∈B ||a − b||.The range value for the HD is [0, ∞].
The absolute difference in the volumes ∆V between the prediction volume V x and the ground truth volume V y can be expressed as: The range for ∆V is [0, ∞], and ∆V (V y , V x ) = 0 represents a perfect match between the two volumes.The (V y , V x ) is an essential evaluation metric for the WIS group due to the lack of ground truth segmentations in this group.The other metrics are not suitable for understanding how the predictions will be since the ground truth will always be empty.
The best scenario for a model is to produce high DC with low HD and ∆V : this implies a strong correlation between the predicted areas and the ground truth regions.If the results show high ∆V (or HD) with low DC, an over-segmentation of the ischemic areas is perceived.On the other hand, promising outcomes of ∆V (or HD) with mediocre DC results imply an under-segmentation of the predicted regions.
Table 4 presents the evaluation metrics' results over the validation set.Results are presented for each group distinctly (LVO, Non-LVO, WIS, and all) to the strengths and weaknesses of each model over the various groups composing the dataset.An extensive number of experiments are performed for all the analyzed models.However, to present a fair comparison among the various models, we only introduce the methods with a combination of parameters that yield the best results omitting the other combinations tested during experiments.Qualitative comparison results of random brain slices extracted from the validation set are provided in Fig. 8.

C. ABLATION STUDY
To demonstrate the effects of the pre-processing steps (Sec.III-B), we conduct an ablation study on the 4D mJ-Net architecture.Moreover, we re-sampled the CTP scans to handle the irregular temporal dimension and studied the effect of using re-sampled scans during the model's training.Different CT scan vendors have different imaging acquisition protocols, thus re-sampling the scans to a fixed time-sampling-rate is an reasonable step to increase the versatility and usability across hospitals.DC are illustrated in Table 5 showing performances TABLE 4: Experiment results for the validation set.Values in bold exhibit the best results for each column and class.Mean results plus standard deviation for Dice Coefficient (DC), Hausdorff Distance (HD), and ∆V are presented.Results are for the penumbra and core areas divided by the distinct patient groups (LVO, Non-LVO, WIS, and All).Note that for the DC, higher values are better (⇑), while for HD and ∆V , lower values are preferable (⇓).  of the network for all the groups trained with the datasets using different types of pre-processing steps and re-sampled scans.The study aims to systematically analyze the contribution of each pre-processing step toward improving the overall results.We begin by defining a baseline configuration consisting of the raw input images without pre-processing (first row in Table 5).Subsequently, we incrementally introduce and evaluate individual pre-processing steps, such as histogram equalization (HE), gamma correction (γ), and z-score (z).

D. INTER-OBSERVER VARIABILITY
Two expert neuroradiologists (NR 1 , NR 2 ) manually annotated the scans of 33 randomly selected patients: 19 from the LVO group, 11 from the Non-LVO, and 3 from the WIS subset.The manual annotation images were generated using the same criteria endorsed for creating the ground truth images, as explained in Sec.II-A.An investigation of the inter-observer variability between NR 1 , NR 2 , and the two best-proposed models is presented in Table 6.

VI. DISCUSSION
Early detection and intervention in AIS patients are of vital importance [67]- [69].In this study, we have proposed different architectures to utilize the 4D CTP input to use the spatio-temporal information better than in existing approaches.We suggest expanding the mJ-Net and showing two ways of segmenting ischemic areas in patients suspected of AIS.In addition, we expand another method (3D-TCN ) for comparison reasons.We use the entire raw 4D CTP data and feed different combinations as input to our proposed approaches to prevent possible loss of spatio-temporal information.Studying the data as an independent volume and neglecting its spatio-temporal nature can lead to the loss of relevant information.All proposed approaches return a series of 2D segmented images as output, later stacked together to produce a 3D volume.Returning a list of 2D images as output is less computationally expensive and less memory intensive than directly returning 3D volumetric data as output [70].
Few studies have adopted 4D datasets in DNN models to detect ischemic lesions in patients affected by a stroke [7], [41]- [43].This is rooted in the high computational complexity of 4D data and the lack of ground truth for the whole set.The limitations that these approaches encounter are as follows: 1) datasets used for the training and evaluation take into account only a subset of the entire population; 2) segmentations are only performed on the core areas, excluding penumbra regions; 3) ground truth images derived from follow-up DWI or NCCT present some limitations [45], [46].
To our knowledge, this is the first study using 4D CTP data to segment both the ischemic regions, penumbra and core.Additionally, we include data from all patients, regardless of stroke severity, to train our models.Rather than entrusting ground truth images from follow-up DWI or NCCT studies that are usually taken 24 hours or several days after the onset TABLE 6: Inter-observer variability results for test set.Values are presented for the two best-proposed architectures (3D+time mJ-Net, 4D mJ-Net) in relation to manual annotations generated separately by two expert neuroradiologists 1 , NR 2 ) over the test set.An investigation of the inter-observer variability between NR 1 and NR 2 is performed (last row or each class).Note that for the DC, higher values are better (⇑), while for HD and ∆V , lower values are preferable (⇓). of stroke, our proposed methods were trained with ground truth images obtained from the CTP captured at admission, PMs, and follow-up scans (Sec.II-A).

Method
We use three evaluation metrics to assess the models' performances: DC, HD, and ∆V compared with our previously developed algorithms and other state-of-the-art algorithms.Results in Table 4 demonstrate that increasing the input dimension benefits achieving more precise segmentation, especially for the Non-LVO and WIS groups, regardless of the class.Thus, when a smaller portion of the brain is affected, the whole dataset's usage helps achieve better segmentation results.The ablation study (Table 5) shows how including the pre-processing steps and not re-sampling the CTP scans helped improve the overall segmentation performances.It is worth mentioning that using the pre-processing steps and resampled CTP scans yields the second best overall results ( last row in Table 5), establishing the validity of the pre-processing sequence.
Visual results of random validation brain slices are shown in Fig. 8, where we can see that our proposed approaches (3D+time mJ-Net, 4D mJ-Net) are less prone to oversegment, especially in the Non-LVO and WIS groups.It is reported that LVO cases are less common compared to Non-LVO.On average, LVOs are estimated to represent around 30% of all AIS cases [71].Thus, a neural network that can accurately segment patients in the Non-LVO group can be valuable in a real-life scenario.Nonetheless, patients with LVO represent a clinically significant proportion of patients presenting with AIS, especially considering the grim natural course of the disease.
The results presented in Table 4 indicate that all mJ-Net models have improved where the input data dimension has increased, regardless of the patients' group.Fig. 8 shows that using 2D+time input for the mJ-Net [51] led to oversegmentation of penumbra class in separate brain tissue sections, brain slice 4-6.The visual results for the Non-LVO and WIS groups highlight the limitations of this model: the oversegmentation of the penumbra regions might affect the usage in a real-life scenario, and an overestimation of the penumbra area can generate uncertainties for treatment decisions.
Adding depth as an extra dimension to the input of models (3D+time mJ-Net and 4D mJ-Net) determines an increment in the performances for both classes in the three patient groups.A significant increase is noticeable for the DC metric in the Non-LVO group, regardless of the class.An essential difference between these two architectures is how they exploit their structures' input.The 3D+time mJ-Net is considered a late-fusion approach as the data sources are used independently and fused close to decision-making.Statistical results presented for the 3D+time mJ-Net show promising general performances for the LVO group.However, an underestimation of the core class, regardless of the patient group, can be noticed from the visual results in Fig. 8 and the low HD metric.Nevertheless, 3D+time mJ-Net achieved the best HD for the core class in all the groups.The 3D+time mJ-Net can precisely segment ischemic regions with large areas, as shown by the first three slices in Fig. 8.This can also be manifested in the high DC score achieved for the LVO group for both classes (Table 4).
The 4D mJ-Net network has learned to precisely segment the ischemic regions even without re-sampling the CTP scans, as shown in Table 5.The model fuses the data before they are fed to the network.Visual results in Fig. 8 and values in Table 4 indicate that the 4D mJ-Net model segments the penumbra class more precisely compared to the other approaches that use raw CTP as their input.This promising performance follows in all patient groups.The 4D mJ-Net achieved the highest DC metric for core and penumbra regions in patients with Non-LVO.This approach gives the best HD for penumbra in all the groups.The 4D mJ-Net showed high precision in detecting small ischemic areas, as shown in sample brain slices 3 to 5 in Fig. 8. Furthermore, the 4D mJ-Net model can correctly predict no ischemic regions in WIS patients, as demonstrated by the results for the ∆V in the WIS group.However, it over-segments the core class in patients with LVO.This means that including the complete spatio-  temporal information of the data and following an early fusion approach leads to better prediction in Non-LVO and WIS groups, where small areas are of interest.Models based on TCN, in general, showed poor results statistically in Table 4 and visually in Fig. 8.They extremely over-segment the penumbra class and poorly segment the core class.The original 2D-TCN and 3D-TCN-SE were designed to segment only one class, the ischemic core.This can explain the poor performance of segmenting the two classes.Besides, in Amador et al. [42] (3D-TCN-SE), the model was trained to use only the ipsilateral hemisphere.For a fair comparison, the model's training was done over both hemispheres, which can cause over-segmentation in penumbra regions.
As the name indicates, the Multi-input PMs model [52] takes parametric maps and pre-processed data obtained from CTP scans.The experiment results of this model show a high DC value for the penumbra class in the LVO group, as also seen in the first three brain slices of Fig. 8.This highlights that this method presents satisfactory results for large ischemic areas.However, when the region's volume is small or vacant, the predictions are not optimal: see brain slices 4 and 6 in Fig. 8.Although HD and ∆V are encouraging, DC values show under-segmentation in the core and penumbra classes for the Non-LVO set.Using PMs derived from CTP scans limits the machine to only learn from specific pre-processed information.
The inter-observer variability results, highlighted in Table 6, show promising outcomes for the proposed methods with the results achieved by the two expert neuroradiologists (NR 1 , NR 2 ).Similar statistic values can be observed between NR 1 vs. NR 2 and the 4D mJ-Net for the penumbra class in connection with the LVO group.The same results as the neuroradiologists were achieved by the 4D mJ-Net for the ∆V in the WIS group.The proposed 3D+time mJ-Net model produces higher results for the DC compared to the 4D mJ-Net in association with the penumbra class.However, the results for the core regions could be more satisfactory.The inter-observer variability outcomes for the HD and ∆V highlight substantial similarity among the proposed approaches and the neuroradiologists, except for the core class connected with the LVO group.The difference can be due to an oversegmentation of this particular class, which can be highly complex to detect for the models considering its small size.

A. COMMON LIMITATIONS
All the assessed approaches have faced general limitations.The images used during the training of each model are from CT scanners of the same vendor.This causes a lack of diversity in the data.The annotations used as ground truth surround the essential ischemic regions (penumbra and core) but do not represent the areas perfectly.They might leave out small parts of the core spread into the penumbra tissue and details of the penumbra misclassified as healthy brain tissue [51].

VII. CONCLUSIONS
Fast and precise diagnosis and treatment are of vital importance in AIS patients.In this paper, we proposed to use 4D CTP as input to extract spatio-temporal information for segmenting core and penumbra areas in patients with AIS.This is presented primarily by expanding the mJ-Net in two ways.Furthermore, we introduced a novel 4D-Conv layer to exploit spatio-temporal information.Two of our approaches (3D+time mJ-Net and 4D mJ-Net) achieved promising results for all the classes involved.3D+time mJ-Net can precisely delineates large ischemic areas, while underestimating the core class.Our best network (4D mJ-Net) can correctly segments penumbra regions, regardless of patient groups with a 0.53 DC score on average.However, with an average of 0.23 DC score, it overestimates the core class for the LVO group.
We used the entire 4D CTP dataset of all patients and compared models using different input types.We demonstrated that relying only on images derived from the CTP scans (i.e., PMs) or on a restricted number of dimensions (i.e., 2D, 2D+time, 3D) limits the prediction accuracy in DNNbased approaches.Moreover, we segmented both penumbra and core regions in ischemic brain tissue since an accurate and fast understanding of both is essential for fast treatment decisions in AIS.
Further studies with larger datasets, including images from different vendors and various acquisition parameters, are still needed to validate our methods.The ISLES18 dataset [72] can be used in future work; the dataset uses FIAs as ground truth labels, which is not in the scope of the aforementioned architectures, thus some changes must be implemented for validating the methods.Due to complex and time-consuming work for manual annotations, further work on optimizing the segmentation using unsupervised neural networks is encouraged.
Fig.1apresents a visual representation of the 2D-Conv g ′′ (x, y, z) with a 3D input and a 2D kernel.2D-Conv with a 3D input and a 2D kernel 3D-Conv with a 3D input and a 3D kernel

FIGURE 1 :
FIGURE 1: Visual examples of (a) 2D-Conv and (b) 3D-Conv.Both examples utilize a 3D input, and no padding is applied.Each box corresponds to a pixel value.

FIGURE 2 :
FIGURE 2: Visual representation of a 4D-Conv layer.The input is V tz I = φ( V t zi−1 , V t zi , V t zi+1 ) ∈ R (X ×Y ×Z ×T) .A series of 3D-Conv operations are calculated over a 4D input.Several groups (G i−1 , G i , G i+1 ) are generated, one for each volume involved in the operation.
Fig.4aprovides a general architecture overview.They proposed a Temporal Convolutional Network (TCN), which has been shown to outperform 2D-TCN architecture proposed by Amador et al.[41].3D-TCN-SE proposed by Amador et al.[42]
tj 2D-TCN to extract features in the latent space.Each E tj 2D-TCN encoder returns a (4 × 4 × Ch) feature vector, where Ch corresponds to the number of channels.The architecture merges the low-level feature vectors across the different t j time points to capture the spatio-temporal information.The merged feature vector ETOT 2D-TCN = [E tj 2D-TCN |∀t j ∈ t] is used as input to the TCN, which yields a one-dimensional vector O 2D-TCN of 64 elements.Finally, a decoder takes the O 2D-TCN and generates a final 2D image P zi (x, y).The Dice Coefficient loss (DCL), the same as the original paper, was implemented as the loss function as follows: DCL(x, y) = C c 1 − 2

FIGURE 7 :
FIGURE 7: Illustration of the 4D mJ-Net architecture.The 4D input V tz I = φ( V t zi−1 , V t zi , V t zi+1) is the concatenation of a 2D+time volume V t zi of a brain slice z i at index i over all the time points t plus its neighbouring brain slice volumes ( V t zi−1 , V t zi+1 ).Two MonteCarlo dropout layers[61] are added at the end of the 4D and 2D blocks.The rate was set to 50%.These layers were added to reduce uncertainties in the final predictions.The last convolution layer has a kernel of (1×1) and a Softmax activation function to produce a probability score for every class.

FIGURE 8 :
FIGURE 8: Qualitative comparisons for the tested models.The brain slices are taken randomly from distinct patients from the validation set, divided by group (left to right).Each row represents the results for each model involved in the study.

TABLE 1 :
Division in training, validation, and test dataset.
is a set of brain slices containing

TABLE 3 :
Summary of the approaches.

TABLE 5 :
Ablation study for the 4D mJ-Net model showing how various pre-processing steps (HE, γ, z) and re-sampling (⊎) affect the Dice Coefficient (DC) for the validation results.Penumbra and core DC scores are shown for all the classes together.Note that for the DC, higher values are better (⇑).