Modeling of Electronic Health Records for Time-Variant Event Learning Beyond Bio-Markers—A Case Study in Prostate Cancer

Electronic health records (EHR) of large populations constitute a vast untapped resource for data-driven diagnosis and disease progression. We develop a model capable of predicting future steps in a patient’s journey for prostate cancer (PC) and its metastases without relying on direct biomarker-measurements on a set of <inline-formula> <tex-math notation="LaTeX">$18\,529$ </tex-math></inline-formula> EHR. To this end, we 1) harmonise EHR without presumptions–events are sorted and grouped by fundamental a priori principles; 2) develop a new Long-Short-Term Memory (LSTM) recurrent neural network node for learning temporal relations, on which we build an autoencoder based model; 3) derive a graph representation based on unsupervised <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-means clustering of events related to PC in the autoencoder’s latent layer. We report <inline-formula> <tex-math notation="LaTeX">$88 {\%}$ </tex-math></inline-formula> predicting accuracy for the targeted metastasis-related events, and lower accuracies for more general events. The model gains interpretability with a graph representation illustrating the patient journey. Most importantly, we predict that <inline-formula> <tex-math notation="LaTeX">$20 {\%}$ </tex-math></inline-formula> of all PC diagnosed patients will progress into metastatic disease one visit ahead of time. For the remaining patients we can predict the next step in their journey. We conclude that the model based on the new LSTM node provides a valuable tool for earlier diagnosis of life threatening metastases and quality assurance of the procedure.


I. INTRODUCTION
Prostate cancer (PC) is the second most common cancer in men with 19 % prevalence, over 4 500 diagnoses in Denmark in 2019, and an absolute number of 1. 3

M diagnoses
The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
worldwide.In 2019, PC was also the second most frequent cause of cancer related death with a lifetime risk of 15 % and accounting for 4.7 % of all registered death causes for men in Denmark [1], [2].Although these numbers underline the risk related to PC, PC patient screening is under prioritised.Guidelines in Denmark [3], [4] do not recommend neither systematic, nor opportunistic early screening, given that PC manifests itself seldom before the age of 50, and half of men of the age of 60 will be diagnosed with a clinically insignificant PC.However, autopsy studies have shown that PC can be detected significantly earlier than that [5].Looking back over the last decades, only a minor reduction in PC mortality has been observed, even though advances in detection, treatment, life-prolonging and palliation have been made.One of the main challenges with PC is the lack of ability to predict which of the patients will develop metastatic and thereby lethal PC -and which of the patients will continue to have an indolent PC.Biomarker panels and pathology nomograms have yet to show they can predict the course of the disease for a PC patient, which has lead to massive over-treatment while having marginal effect on PC mortality.In consequence, foreseeing the course of a PC patient, or any cancer patient, is of tremendous clinical relevance to select interventions with best patient-outcome while minimising side effects.
Since the introduction of prostate specific antigen (PSA), diagnoses and disease progression of PC have been guided by this bio-marker.Under normal conditions, only low levels of PSA can be detected in the blood, and the increase of serum PSA found in PC can represent abnormalities in prostate gland architecture.Historically, PSA has been utilised for monitoring the progression of patients already diagnosed with PC, or for recurrent curative therapy.In 1987, a large study demonstrated PSA as the most sensitive bio-marker for monitoring PC progression [6].Here, it was shown that the PSA level increases with advancing clinical stage and that it is useful for detecting recurrence after therapy.Subsequent studies have explored the PSA's ability for early disease prediction.In 1991, it was demonstrated that a combination of PSA measurements of more than 4.0 ng/mL with other clinical findings improved the early detection of PC among 1 653 healthy men without predisposition of cancer [7], [8].
However, like other diagnostic tools, so does PSA have its limitations.For instance, the diagnostic test performance of PSA is volatile.Particularly, the specificity ranges from 20% to 40% [9].The relatively low specificity can be explained, as other non-cancerous circumstances, such as inflammation, infection and benign prostatic hyperplasia, can elevate the PSA level.Furthermore, up to 15% of men with low levels of PSA have PC [10], [11].It is therefore not possible to reliably predict the risk of severe cases of PC with PSA alone.Adding to this, PSA has lead to an increase in detection of insignificant PC findings [5].
Despite its shortcomings, PSA remains an inexpensive and sensitive bio-marker for PC detection and disease progression monitoring.These features have made PSA usage common in screening procedures such that additional bio-markers for clinical evaluation of PC often are obtained after the initial diagnoses.As a consequence, PSA retains its place as a primary clinical tool for PC diagnostics alongside imaging and biopsy based approaches -unless new methodology can be put forward to include other historic medical data previously thought to be unrelated to PC.
While PSA measurements retain a role in diagnosing a patient with PC, the Danish National Patient Registry (DNPR) offers an untapped resource in terms of historic clinical recordings for other than PC related diagnostic procedures.These records are digitised and can be accessed as electronic health records (EHR).Schmidt et al. [12] provide a comprehensive review of the content, quality, and research potential of the DNPR.In the time during 1977 to 2012, the DNPR registered more than 8 million persons with detailed administrative and clinical data.In addition, the DNPR provides data sources for disease identification, examination records, in-hospital medical treatments, and surgical procedures.Schmidt et al. [12] value the DNPR as a source for long-term temporal trend analysis.
Under the hypothesis that EHR contain information about disease progression, these data can be used for the development of predictive models able to predict patient trajectories under different medical contexts.These models are expected to help improve the quality of medical assessment in general, but also to propel the development of personalised medicine based on individual, i.e. per patient, predictions of disease progression [13].Based on such predictions, individual risks can be estimated or potential treatment options evaluated.A good model should be able to be applied to unfiltered data and ideally be able to identify relevant elements and time scales in EHRs, to allow for the generalised application to different diseases without manual tuning.
In this paper, we present a generalised predictive model derived from EHR data using PC as a case study.We thereby aim to predict the series of events (diagnostic, treatment, and procedure codes) up to and from a diagnosis of PC.Subsequently, we propose a way this model can abstract higher level patient journeys without domain knowledge.This model is free of pre-selected diagnostic biomarkers, such as PSA or testosterone.The contribution of this work is twofold.i) After an introduction to the patient data included in this study (Section II), we present a tool to model and predict nonuniform sampled EHR events in Section III.We are offering an adaptation of long-short-term-memory (LSTM) recurrent neural networks (RNNs) that is able to learn relevant temporal relations between EHR events while discarding redundant and unrelated events.ii) In Section IV, we facilitate the results, validation, and discussion of the model outcomes with respect to PC, highlighting the clinical relevance for data driven solutions, such as the proposed model.In this part of the manuscript, we are focusing on the model's ability to generalise the prediction of patient journeys.
We conclude that the presented model is able learn from unfiltered datasets based on EHR records, thereby achieving superior performance in domain relevant predictions, highlighting the ability to extract relevant time scales and events in an unsupervised manner.

II. CLINICAL DATA DESCRIPTION AND PREPARATION
This study is based on a dataset composed of 18 529 patients diagnosed with PC at least once between January 1, 2004, and 50296 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.December 31, 2019, in a state-funded clinic in the Region of Southern Denmark. 1 The average age of a patient in this study on January 1, 2004 is 62 years.
A diagnosis for PC is defined as the presence of a DC619 code in a patient's journal using the Danish Care Classification System (SKS, from Danish, Sundhedsvasenets Klassifikations System).The SKS codes associated with PC and a possible metastatic disease are listed in Table 1.The average age for the first PC diagnosis is 71±9 years, with the youngest patient being 30 and the oldest 98 years old at the time of diagnosis.Based on metastasis related codes in Table 1, 12.6 % of the patients are identified to have a metastasis related PC diagnoses -which is below the estimated 20−30 % worldwide [14].For these patients, the average age for their first metastasis related diagnosis is 75 ± 8 years.The average time between the first diagnosis of PC and a confirmation of a metastatic disease is 1 132 days (approx.3 years), with the lower and upper quartile measuring 0 and 1 988 days (approx.5.5 years) respectively.

A. DATA DESCRIPTION
The data provided by the Region of Southern Denmark contains a wide range of SKS codes.These codes are used for sharing and delivering structured information for different information systems.This study does not exclude any SKS codes that might be present in the underlying dataset.For an exhaustive list of all SKS codes we refer the interested reader to the medinfo.dkdatabase [15].
The SKS database consists of 17 classes, ranging from diagnostic codes, over medical procedure, to administrative codes.These are further divided into chapters and sections.Table 2 shows this hierarchy by the example of malignant plasma cells neoplasms (DC90).DC90 is categorised under classes classification of disease and health related conditions (D), in the chapter of Neoplasms (Chapter 2), in section Cancer in lymphatic and hematopoietic tissue (Section 15), and has four sub classifications DC900, DC901, DC902, and DC903.
In a Danish patient journal, these codes are stored as what we refer to an event.An event is a code recorded at a specific time, see table 3a for an example.

B. DATA PREPARATION
For each possible event in the dataset, we assigned a unique variable E n , with 1 ≤ n ≤ N , N the number of unique event codes.Table 3 shows how the aforementioned unstructured SKS codes are converted to a running variable E n .Given the sparse and volatile nature of patient data, it is likely that any E n can be less frequent than others, or simple appear by chance.In order to avoid modelling of sparse events which are not represented well in the dataset, we agglomerated (clustered) these events, exploiting the hierarchical structure we arrived at a data vector x (p) t = E t | δ t coding the t th visit of patient (p), with P the number of patients: 2 = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 45] ⊤ 3 = [0, 0, 1, 0, 0, 4 = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 41] ⊤ 1 = [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0] ⊤ (1e) We denoted the dimension of these vectors as N + 1, which is the sum of the number of unique events in all patient journals plus the time since the last visit.Note that the first visit always codes with δ t = 0. Collecting all data vectors which code for patient p's visits, we can define a matrix where columns i contain the data vectors x i : As the number T p of visits in patient p's journal depends on the patient under consideration, the matrices X p have varying numbers of columns.For further processing, we zero padded them to have the same number T = max 1≤p≤P T p of columns and then turned them into a tensor, e.g.: 1 , . . ., x

III. METHODOLOGY
Our model is based on the hypothesis that there exists a causal relation, or at least a correlation, between consecutive events in EHR, especially, we make no assumption on why the data for these events were acquired.

TABLE 2. SKS hierarchy example for Malignant plasma cells neoplasms (DC90
), including the absolute frequency # of the codes in the underlying dataset.Note, that the colour codes for the frequency of the event being high (green), or low (red, beyond a threshold of 50 events).Based on the frequency, we agglomerated the low frequency events into events on a higher hierarchy level, until the sum of event frequencies crosses the threshold.Events DC900 and DC903, for example, wouldn't be taken into account individually, but can be agglomerated into a higher level event DC90.Appendix A contains a detailed description of the procedure.
Modelling EHR is no new endeavour, rule-based and regression methods have long been playing a role in diagnostic decision support and provide accurate results in selected disease detection and prediction studies [16], [17], [18], [19].With the increase in data volume and computational power of the last decades, neural networks are gradually replacing regression and statistical models as work horse in data mining and data driven modelling.However, big data analytics comes at the price that the interpretability of the models often decreases.Models such as presented in Lipton et al. and Choi et al. [20], [21] allow for modelling patient data with high accuracy.Nonetheless, these models do not inform about the patient journey, i.e. they do not give an account for when information are relevant in time.Pham et al. [22] include the time between events as a frequency component in their modelling.Pham's DeepCare model weights the time between events as a decreasing function h(δ . Baytas et al. [23] proposed a time-aware LSTM structure, by adding a discounted forget gate in the LSTM node (see Section III-A).While this study follows the same approach on the LSTM node level, Baytas has its focus on modelling decreasing times similar to Pham.Further, they propose a parametric approach to weight the times for informed time scales measured in days [23].Still, this approach implies and requires knowledge on the dominant time-scales for the diseases modelled.In contrast to the assumptions Baytas et al. and Pham implemented when including time into their models, the proposed model maintains an unsupervised approach to learning relevant time scales, which reflects the lack of knowledge concerning the driving time-scales within PC modelling.Therefore, we will use and compare the proposed model's performance to Pham's DeepCare model as a state of the art reference model with time-awareness.
In the following, we will lay out the foundation for a specific kind of neural networks, a so called auto encoder.While the fundamental operations are similar to that of existing models, and temporal considerations have been introduced as hard rules by Pham and Baytas et al. [22], [23], we will arrive at a model interpretation that generalises the patient journey for PC, without assuming any underlying decaying structure or parameterisation on the temporal side -hence we call the proposed model sample frequency independent.It is thus the Iν-LSTM's main contribution to learn relevant time scales in an unsupervised manner from the presented data.

A. INDEPENDENT FREQUENCY LSTM (Iν-LSTM)
There are many ways in which event patterns can be abstracted, ranging from a priori associated rule mining [24] to RNNs [25].Given the recurrent and successive nature of the data, particular the volume at hand, RNNs are a suitable tool to handle the task.LSTM RNNs, proposed first in 1997 [26] and in their current version by Graves et al. [27], are now widely used in a variety of applications.3a) to a running variable (Table 3b) and from there to an eventset E t .We concatenated the time δ t (also Table 3b) to the eventset to create a data vector which the presented method can work on.
LSTMs are explicitly designed to retain long-term temporal dependencies.Like all RNNs, they are composed of repeated computational nodes and can produce a sequence to sequence output.Fig. 1a shows a high-level view of how a (recurrent) LSTM node conserves historic data for reasoning.Here x t and h t denote the input and output sequences, respectively, up to sequence length S. In contrast to naïve neural network nodes, i.e. nodes that are composed of only one activation function, LSTMs are composed of a gated structure whose components can be characterised as input-, output-, forgetgates, a current and candidate memory cell, as well as a hidden state variable.Given weight matrices W, U and bias vector b in a computational node structure, as shown in Fig. 1b, a LSTM neural network node is defined through four operations: LSTM Forget Gate

LSTM Node Activation
A RNN composed of LSTM nodes described by Eqs. ( 6) to (9b) assumes temporal regularity between events.This implicit assumption in the RNN architecture, i.e. that the elapsed time between patient visits is uniformly sampled, makes it unsuited for the dataset at hand.In general, patient visits are not scheduled to follow any distribution and underlie random influences from their environments.Non-uniform time between events is in itself not of concern if the variability of one patient's behaviour can be generalised to many patients.However, patients come with their very individual schedule, thus, temporal variability for the times between the same visits of different patients is to be expected.In the following, we propose an adaptation to the LSTM structure in order to include the aforementioned time between events, .
We treat the irregularities in the timing of patient visits as an inter-and intra-patient independent sample frequency ν.
The independent sample frequency model, Iν-LSTM, adds a separate temporal input to the LSTM structure that amplifies or suppresses an eventset depending on the time that passed since the previous visit.In order to account for long gaps between events, an additional forget gate is introduced.The proposed Iν-LSTM is laid out in Fig. 2 -following the LSTM node structure as proposed by Baytas et al. [23].
Given the same weight matrices as for a LSTM node, a Iν-LSTM is defined through five operations and a temporal map of type Iν-LSTM Memory Gate same as in Eqs.(7a), (7b) and Iν-LSTM Node Activation same as in Eqs.(9a) and (9b).
Many neural networks are concerned with multiple-inputsingle-output or multiple-input-multiple-output maps.In this section we want to address the prediction of the next events in a sequence.In this study, the input space is of size N + 1 and we wish to retain the option to predict any event in the dataset, thus the desired output space is of the dimension N .As we can argue that the diagnostic procedure is taking place in fewer dimensions and that we aim for an abstraction of the underlying diagnostic process, a desired model should be able to encode information in a lower dimensional latent space L of dimension l.
Sequence-to-sequence autoencoders [28] have been used to learn a representation of data by mapping from an input X to a desired output Y.In order to fulfil Eqs.(12), we are thus seeking a map between an encoder ψ and a decoder φ, with latent space L (Fig. 3).For an input in R N +1 the encoder consists of a layer of m Iν-LSTM nodes with sequence length S and subsequent k LSTM layers with decreasing amount of nodes, m 0 > m 1 > . . .> m k , that map into L of size l.The latent space is defined as an element-wise tanh activation function, σ , with weight matrix W and bias vector b: The decoder then employs a sequence of LSTM layers with increasing number of nodes to map to the desired output in R N : 50300 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 3.
Schematics of an autoencoder, the high dimensional input on the left side is transferred into an intermediate representation, the Dense Layer l, before projecting again into a higher dimensional representation, a prediction of the next eventset.Thus, the dense representation l has to retain relevant information, to be able to project back to a high dimensional eventset.
From here on we use for autoencoders of different topology, respectively LSTM (k, k ′ , [m 0 , . . ., m k ], S, l) when only LSTM layers are used.
Training of the autoencoders is achieved by minimising the root-mean-squared error.For an exhaustive list of different loss function and optimisation techniques we refer the interested reader to Bianchi et al., Goodfellow et al.,and Bishop [25], [28], [29], [30].

C. MODEL TRAINING AND VALIDATION
We are proposing topologies for Iν-LSTMs and LSTM auto encoders.The models are constrained to be symmetric w.r.t. the number of layers (k = k ′ ) and nodes in each layer, and allow the width of each layer to be decreasing or increasing with either: Further, the latent dimension l is limited to be either 5 or 3 nodes wide, and the sequence lengths S are limited to 5 or 2 time instances (Table 4).Alongside, we are training DeepCare [22], Associated Rule Mining [24], and Random Forest classification for comparison.
Model performance is evaluated in terms of a Jaccard coefficient based accuracy (i) for predicting the next PC related eventset including metastases (Acc.PC w.Metastases), i.e. the intersection over union of PC metastases related events only, (ii) for predicting the next PC related eventset excluding TABLE 5. Model comparison for test results in terms of accuracy, sensitivity, and specificity for predicting events related to prostate cancer with metastases (first three columns), accuracy for events related to prostate cancer with and without metastases (fourth column), and prediction accuracy for all events in the dataset (in the last column).The best performance in each category is highlighted in bold.metastases (Acc.PC), i.e. the intersection over union of PC related events only, and (iii) for predicting the next following eventset (Acc.E t ), i.e. the intersection over union between the predicted eventset and the actual eventset.For the first case, the sensitivity (Sen.PC w.Metastases) and specificity (Spe.PC w.Metastases) are provided as well.

A. MODEL PERFORMANCE AND COMPARISON
After data preparation, the models are trained on 10 000 out of 18 529 patients, in a 70 % to 30 % training to validation split.The remainder of 8 529 patients is reserved for testing.The results for model validation of the proposed and literature models are summarised in Table 5.
Concerning the accuracy for the prediction of events with respect to metastatic PC, the proposed models scored between 0.59 to 0.88.Apart from of Iν-LSTM 6 , the independent frequency models performed significantly better compared to the simple LSTM or literature models.In terms of sensitivity, the range of independent frequency models comprises 0.69 to 0.76, while specificity ranges from 0.87 to 0.92, not including the measure for Iν-LSTM 6 with 0.62.Iν-LSTM 1 and Iν-LSTM 2 perform overall better than the other tested models, where Iν-LSTM 2 presents marginally better accuracy and specificity than Iν-LSTM 1 .Conditioned on the combinations of the encoder topology, the latent space dimension, and the input sequence length, as given by Table 4, the model accuracy is most sensitive to changes in the latent dimension and sequence length -the latter to a smaller degree.We conclude that three latent dimensions are not as suited to facilitate and discriminate an embedded representation of the sequential data as higher dimensional latent spaces are.The lower influence of the sequence length S could be understood in terms of the LSTMs internal memory capacity already providing a trace of the previous sequence.The same behaviour can be observed for Acc.PC and Acc.E t .Due to the high accuracy in predicting metastatic PC and the lower computational complexity than Iν-LSTM 1 , we select Iν-LSTM 2 as the best candidate model for further investigation for the remainder of this work.For a 10-fold cross-validation of the two best performing models we refer the reader to Appendix C.
When comparing the accuracies between predicting events for the sets PC w.Metastases, PC, and E t , we can see that the accuracy is higher the more specialised the subset is, i.e., eventsets for PC w.Metastases show the best prediction accuracy, whereas the generic eventsets E t are hardest to predict.This observation holds for almost all models, with the exceptions of DeepCare, Random Forests, and Iν-LSTM 6 .This effect is most pronounced for the here presented Iν-LSTM family of models.We argue that this effect represents the Iν-LSTM models' ability to learn relevant time frames from the data.As all patients were selected by the property of having a PC diagnosis, we can expect that all models should be able to pick up common sequences of PC events, whereas all models should have trouble to properly predict eventsets which are not part of the common set of diagnoses but randomly contributed by individual patients and therefore do not correlate with common events.To be able to provide better predictions, a model needs to be able to focus on the relevant events and neglect irrelevant data.The ability to extract the time scales relevant for a disease's progression and treatment can help to make this distinction and is a unique feature of the here presented Iν-LSTMs.
For DeepCare, we observe that the accuracy for predicting events related to PC w.Metastases is in fact lower than the accuracy for predicting events related to PC.Otherwise, the performance is close to that of standard LSTMs.In comparison to Iν-LSTMs, DeepCare models have fixed time scales which are included with a 1/δ t -characteristic.Thus, apart from the topology, the proposed model is the more flexible approach; weighting time contributions is the only difference which can be taken into account when explaining the difference in the ability to learn the specialised underlying structure in the EHR.It is noteworthy that the introduced temporal map is learned unsupervised and therefore is able to adapt to time scales inherent to the presented data.The question whether this advantage of the Iν-LSTMs can be used to improve the prediction of, e.g., risk of metastases or PC progression in general, is the subject of ongoing investigations.
The other literature candidates, Associated Rule Mining and Random Forests, scale poorly with large data, thus they are unsuited in this case.

B. LATENT LAYER INTERPRETATION
Given the nature of autoencoders, we can investigate the lower dimensional embedded structure when passing information through the network.For each eventset E t , we store the latent vector l.Fig. 4 shows the latent space L as a scatter matrix for all eventsets in the training data.Here the data is represented as a density plot over all captured latent vectors l over all pairwise permutations of latent dimensions and histograms when projected into single dimensions.For a more intuitive interpretation of the latent space we have labelled each cluster for visual purposes only (for the full clustering procedure see Appendix B) to a target eventset as: FIGURE 5. Collective disease progression of patients in the test set towards metastatic and non-metastatic PC.Based on the agglomerative event clustering, the dynamics of patient journeys in the latent space, labelled after k-means clustering, allows to reconstruct the relation between diagnostic codes.To focus on the relevant transitions, all transitions with a frequency below 15 % have been neglected in the construction of this graph.The graph clearly shows that PC with metastasis (AZCD41) is the definite end of the diagnostic chain.The non-metastatic code (AZCD40) on the other hand can be the final diagnostic code, but can also transition to AZCD41 directly or via the AZCD49 code, which indicates a state of waiting for a result which can either result in a AZCD40 or AZCD41 diagnostic code.

DC619
Latent space leading to the first eventset containing a DC619 diagnostic code.AZCD40 Latent space leading to the eventset containing the combination DC619 and AZCD40.AZCD41 Latent space leading to the eventset containing the combination DC619 and AZCD41.AZCD49 Latent space leading to the eventset containing the combination DC619 and AZCD49.Prefix AZ Latent space leading to diagnostic codes DC619 with the prefix appendix code AZ that are not AZCD40, AZCD41, or AZCD49.others Latent space leading to eventsets that are not any of the above.
The accuracy of the model predictions already gives a good indication that information relevant to disease progression is present in the abstract, latent representation.Furthermore, clear cluster separation and structure, for example along the dimensions l 1 and l 2 for prefix AZ vs others, or along the dimensions l 1 and l 3 for DC619 vs others, are evidence that the network has embedded correlations and associations between different eventsets.It is also evident that some dimensions, for example l 4 and l 5 , carry no visual interpretable information.Generally, the label ''others'' comes from several clusters, and their projections do not necessarily form continuous shapes.However, the decrease in accuracy for a lower dimensionality l of the latent space is a clear 50304 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
indication that the network has learned along the additional dimensions, although presumably carrying noise.

C. EVENT PREDICTION
While this way of interpreting the embedded space explores how well related eventsets can be discriminated, it lacks an interpretation how the network reasons over time.In the following, we attempt to answer this question by interpreting the latent space in terms of a spatial-temporal representation.We consider now the sequence of events for one patient only, i.e.E (p) t , t = {1, . . ., T p }, and track the change of l over time.Fig. 4 shows the mentioned latent space with the distribution of all eventsets for all patients, overlaid by one patient's journey as sequence of eventsets.For this patient, each successive eventset is connected by an arrow forming a graph representation of which regions activate.The selected patient's journey can easiest be followed at the top left panel of Fig. 4 with the latent dimensions l 1 and l 2 on the axes.We conclude that the clusters used in Fig. 4 correspond to the relevant milestones in a PC patient journal.The journey starts with several diagnoses unrelated to PC (large, blue distribution), before the first ''DC619'' diagnosis (green distribution) appears.Then, the transition goes over ''AZCD49'' to ''AZCD41'', the final metastasis diagnosis of this journey.
Repeating this exercise for all patients in the test set, and introducing a node for each cluster, we can generate a spatialtemporal graph displaying the collective disease progression of all patients (Fig. 5).When evaluating all patients from the test dataset, we can derive that 20.4 % of all patient journeys end in a metastatic stage.This is significant higher than the 12.6 % of the patients' data descriptive statistic.As noted earlier 12.6 % is low compared to the estimated worldwide 20 to 30 % [14].Calculating the median over PC diseases with metastasis from 2017 to 2019, based on data from The Danish Prostate Cancer Database (Dansk Prostata Cancer Database, [31]) and the Danish Health Data Authority (Sundhedsdatastyrelsen, [1]), metastatic PC stages can be found in 26.1 % out of all PC diagnoses in Denmark, which is still higher than the patients' average.Danish urologists and nurses believe the underlying reasons to be varying practice in recording SKS codes and changing systems over time.The issue of this discrepancy in coding and reporting is thus well known and is iteratively handled via updated reporting guidelines [32].We are thus postulating the hypothesis that the proposed network can generalise past recorded data, even if they were erroneously recorded, therefore painting a picture that resembles closer the individual patient journey as it should have been recorded.Extending on this observation, the ability to use the presented method to check the selfconsistency of the dataset in an unsupervised manner makes it a candidate for quality assurance in addition to modelling patient journeys, providing input to policy makers to improve health care on an administrative level.
Discussing these graphs, we have to keep in mind that patients in this study were selected by having a PC related diagnosis, and that the labels were chosen accordingly (Section IV-B) to reflect diagnoses around PC. Based on this dataset, the presented method can clearly reproduce a meaningful representation of disease progression from the initial diagnosis to the possible metastasis diagnoses at the end of the patient journey.In this light, the higher accuracy for predicting PC and especially PC with metastasis is an indication that the model could learn PC specific disease progression, while it performed significantly worse to predict seemingly unrelated events, which are presumable randomly represented in the history of selected patients.Therefore, in comparison to DeepCare, Associated Rule Mining, and Random Forests, this model seems to have the ability to prioritise events based on common elements in the patient history, which allows higher accuracy eventset predictions in regard to the common diagnostic complex, i.e.PC.The question if this ability to prioritise events based on common elements in the patient history can be used to find additional indicators for PC diagnosis has to be answered in a future study.

V. CONCLUSION
In this study, we present a modified LSTM node, which is able to extract time scales from non-uniformly sampled inputs.This unsupervised approach to relevant time scales in a data set sets this model apart from other state of the art methods, which use functional or parametric approaches to weight relevant time scales.Applied to a dataset which features patient journals from the Danish National Patient Registry, which were selected based on exhibiting a prostate cancer related diagnostic code, the model based on these modified LSTM nodes is able to achieve high accuracy, sensitivity, and specificity for metastasis related event prediction, which surpasses state of the art methods significantly in absolute numbers.Yet the main characteristic is that the presented model is not predicting all events in the dataset with the same accuracy.While all tested state of the art models perform slightly better for prostate cancer related events, the presented method shows a significant improvement of ≈10 % points when predicting PC related events over the whole event set, and another ≈10 % points improvement can be seen when restricting the evaluation to PC with metastasis.In this case, the tested state of the art methods do not show a significant change.
The model includes an autoencoder whose dimension has a clear influence on the resulting performance of the model.Furthermore, we demonstrated that the latent variables can be used to create an abstract representation of the patient journey which can be used to reconstruct typical patient journeys and found the reconstruction to be in good agreement with the guidelines.Evidence that the model has the ability to predict the course of PC is a step closer to fill the gap for predictive models in a clinical setting.
In consequence, we conclude that the presented model can focus learning on a subset of events in a problem specific dataset that matches the problem at hand.This sets the presented method apart from the state of the art and allows for many applications.
In future studies, we want to focus on the question if this domain specific knowledge, which is implicitly accumulated in the model, can be used to improve prediction of patient risk factors and to identify relevant diagnostic codes which are not directly related to PC.The identification of such markers would allow to improve the diagnostics of PC by harnessing knowledge about other, seemingly unrelated medical incidents and to suggest additional procedures or treatment options.Based on the observation that graphs are a powerful basis for the understanding of cause-effect relations [33], we will specifically investigate in how far the relations identified with this method can be used for causal modelling and the identification of e.g.relevant preconditions, interventions, and life-style options, by including relevant data into the generated graphs.Extending beyond prostate cancer, future activities will clarify if the method can be applied to model other disease complexes and how it handles data which is selected by more than one disease complex as criteria.We base this on the hypothesis that for any data set drawn from a larger set, by selecting for a specific disease, the model should similarly extract a meaningful latent space representation if the assumption of correlation between the data and the selected disease exists.Generalising further, research into the applicability as quality assurance tool bears the chance of vastly improving health care on an administrative level.

APPENDIX A AGGLOMERATIVE EVENT CLUSTERING
For each event in the dataset we encode a unique variable E n such that E = {E 1 , . . ., E n , . . ., E N } (18) is a set of all N possible recorded events (e.g.diagnoses, procedures, treatments, etc.), with t = {1, . . ., T p } being the times at which events can be observed for a patient p.For each t there then naturally exists a subset of E (including the empty set), denoted E t ⊆ E t , t = {1, . . ., T p }|, ∀p (19) we demand a minimum support, i.e. frequency of the corresponding code, before a given SKS code can be stored in E. For each E n those support is smaller than the minimum support, we calculate the support of an agglomerative event E ↑ n by accumulating the support of all lower hierarchy SKS codes E ↓ n , that do not fulfil the minimum support, until it exceeds the minimum support required.
In the example of Table 2, for supp min = 15, the codes DC901 and DC902 would become separate events, whereas DC900 and DC903 would not cross the threshold.Codes which do not cross the threshold will be collected on higher hierarchy levels.In this case, going up one level (DC90) and combing DC900 and DC903 into one agglomerated event would cross the frequency threshold of 15.
For the present analysis, we chose supp min = 50.

A. TIME BETWEEN EVENTS
We refer to δ t as the time between two events.It is always positive and unbound, [0, ∞) and without loss of generality can be continuous or discrete: Carrying on with the example of Table 3 Resulting in the vectors given in equation (1).

APPENDIX B k-MEANS LATENT SPACE CLUSTERING
We label the learned eventsets by using k-means clustering.We will first summarise k-means clustering, before we explain the label selection below.
To find the clusters, we partition all latent vectors l i into k sets S ∈ {S 1 , . . ., S k } so as to minimise the total cluster Euclidean distance (l|k) between the latent samples l i and the cluster centres C j .
and maximising the cost for how well samples of l i lie within a cluster:

FIGURE 2 .
FIGURE 2. Illustration of the proposed independent frequency adjusted LSTM (Iν-LSTM) node.The modifications to the original LSTM (Fig.1b) are added on the left side, to incorporate the temporal information δ t into the memory gate update.During training, the temporal map h(•) learns relevant timespans from the presented data.
the support of an event as countablesupp(E n ) ≡ |{t|E n ∈ E (p) i ,l m ∈S j m̸ =i ||l i − l m || 2(25)

TABLE 1 .
SKS codes associated with PC and metastatic PC translated from Danish.

TABLE 3 .
Example of the conversion process from SKS codes (Table

TABLE 4 .
Model selection, including the depth k of the encoder and k ′ of the decoder .k = k ′ , the LSTM layer sizes m i , the sequence length S, and the latent space dimension l .