SHAPE: A Sample-Adaptive Hierarchical Prediction Network for Medication Recommendation

Effectively medication recommendation with complex multimorbidity conditions is a critical yet challenging task in healthcare. Most existing works predicted medications based on longitudinal records, which assumed the encoding format of intra-visit medical events are serialized and information transmitted patterns of learning longitudinal sequence data are stable. However, the following conditions may have been ignored: 1) A more compact encoder for intra-relationship in the intra-visit medical event is urgent; 2) Strategies for learning accurate representations of the variable longitudinal sequences of patients are different. In this article, we proposed a novel Sample-adaptive Hierarchical medicAtion Prediction nEtwork, termed SHAPE, to tackle the above challenges in the medication recommendation task. Specifically, we design a compact intra-visit set encoder to encode the relationship in the medical event for obtaining visit-level representation and then develop an inter-visit longitudinal encoder to learn the patient-level longitudinal representation efficiently. To endow the model with the capability of modeling the variable visit length, we introduce a soft curriculum learning method to assign the difficulty of each sample automatically by the visit length. Extensive experiments on a benchmark dataset verify the superiority of our model compared with several state-of-the-art baselines.


I. INTRODUCTION
R ECENTLY, massive health data have offered the oppor- tunity to assist clinical decision-making through deep learning [1]- [6].Effective and safe medication combination recommendation for patients who suffer from multiple diseases is an essential task in healthcare [7]- [9].There are a lot of research interests on medication recommendation task [10]- [19].The intuitive goal of medication recommendation is to predict medication sequences for a particular patient based on complex health conditions.Existing strategies of medication recommendation can be categorized into two types: 1) Instance-based methods, which recommend medication sequences only based on the current hospital visit(e.g., diagnosis, procedure) [20]- [23].The instance-based setting will ignore the temporal dependencies on the patient's health records.To overcome this issue, 2) Longitudinal-based methods were proposed to leverage the longitudinal patient records to predict personalized medication.Most longitudinal methods pursue enhanced representations of patient health status based on the historical health records(e.g., diagnosis, procedure) and use this patient representation to conduct medication recommendations [24]- [31].
Despite the significance and value of the methods in the longitudinal methods, they still suffer from two critical limitations: 1) One problem with existing longitudinal works is that they neglect the compact intra-relationships between medical events within each visit.In other words, they ignore the relationship between the same type of medical codes during a visit.2) Existing longitudinal models are static.Namely, all samples go through the same fixed computation flow.This may be powerless on the shorter records, which lack historical information.
On the one hand, existing longitudinal methods use the historical code sequences (e.g., medication, diagnosis) within each visit to present the complex patient's health condition, where medical events are adopted independently and sparsely represented methods to obtain equal contributions representation in the current record.Most of them use multi-hot embedding methods to encode the structured data sequences.However, the impact of medical events varies for each patient, especially for patients with multimorbidity.For instance, during a visit, the health condition differs a lot between a  patient diagnosed with both Chronic systolic heart failure and Septic shock and a patient diagnosed with both Septic shock and Acute respiratory failure.Previous methods ignore the compact intra-relationship of these medical events and the variable importance of each code for the patient.
On the other hand, such longitudinal patterns rely on historical health information and are powerless to the short visit that lacks historical records.As shown in Figure 1, we conduct the statistic on the MIMIC-III [32] dataset.We can see that most visit lengths are short than thrice.For each visit, we calculate the Jaccard between current medications and past medications.We can see that a large portion of prescribed medicines are similar to those recommended before, which means the results of medication recommendations rely on historical medication records.Additionally, we conduct fine-grained statistics of the MIMIC-III dataset, as shown in Figure 2. We calculate the proportion of medications that have appeared in history and the Jaccard with various visit windows.We can see that in the more extended visits, a large portion of drug sequences have been recommended before.However, the prevalence of short visit records in real-world clinical scenarios often lacks crucial historical medication information that could be referenced for treatment decisions.This phenomenon illustrates that a more robust strategy that could model the accurate representation of the variable longitudinal sequences is urgent.
To overcome these challenges, we proposed a novel Sampleadaptive Hierarchical medicAtion Prediction nEtwork, named SHAPE, to learn a more accurate representation of patients.In SHAPE, we develop a hierarchical patient representation framework.Concretely, we first tailor an intra-visit set encoder to learn the visit-level representation and then design an inter-visit longitudinal encoder for learning the patientlevel longitudinal representation.By performing the intra-visit set encoder and inter-visit longitudinal encoder, collaborative information latent in longitudinal historical interactions is explicitly hierarchical encoded.To enhance the ability to represent various lengths of visit records, we adopt a soft curriculum learning method to help our SHAPE model learn these data patterns by assigning the difficulty weight to each sample.The experiments on a public dataset demonstrate the effectiveness of our proposed model.
The main contributions of this work are three-fold: • We present a hierarchical encoder mechanism towards medication recommendation, which could dig for a more accurate representation from the various records of the patient.In particular, we first design an intra-visit set encoder to encode the medical events and obtain visitlevel representation, and then develop an inter-visit longitudinal encoder for learning the patient-level longitudinal information.
• We design an adaptive curriculum learning module for variable patient visit records, especially for the short ones, which aims at an adaptive learning strategy over time and the length of patient records to improve the effectiveness of medication recommendations.• Extensive experimental results on the public benchmark dataset validate the effectiveness and superiority of our proposed method.

A. Medication recommendation
Existing medication recommendation algorithms can be categorized into instance-based methods and longitudinal approaches.Instance-based algorithms extract patient information only from current visits.For example, LEAP [22] extracts patient representation from the current visit record and decomposes the medication recommendation into a sequential decision-making process.Longitudinal-based methods are designed to leverage temporal dependencies within the patient's historical information.For example, RETAIN [24] uses twolevel attention, which models the longitudinal information based on recurrent neural networks (RNN).GAMENet [26] uses augmented memory neural networks to fuse the drugdrug interactions and store the historical drug record to model the patient representation.MICRON [27] pays attention to the changes in patient health records and uses residual-based network inference to update the sequential representation.COGNet [29] conditional generates the medication combinations either copied from the historical drug records or direct generate new drugs.These existing efforts, however, still suffer from the following limitations.Existing work ignores that the intra-visit medical events may pay variable effects on differing the health status of the patient.Most of them use multi-hot embedding to encode the medical events in the current visit and ignore the difference of each medical event in intra-visit records.In this paper, we proposed a hierarchical architecture to learn the comprehensive patient representation.We use an intra-visit set encoder to learn a more accurate representation of intra-visit medical events and develop an inter-visit longitudinal encoder to learn longitudinal information about the patient.

B. Curriculum learning
The conventional curriculum learning methods formalized the organized learning process of humans and animals, which illustrates gradually more complex ones [33].Alex et al. derived two distinct indicators (i.e., rate of increase in prediction accuracy and rate of increase in network complexity) of the learning process as the reward signal to maximize learning efficiency automatically [34].Guy et al. introduce sorted samples with different scoring functions to assign the learning difficulty of each instance [35].Recently, curriculum learning has been applied to different medical tasks.Basu et al. propose a curriculum inspired by human visual acuity, which reduces the texture biases for gallbladder cancer detection [36].Guo et al. demonstrate the application of curriculum learning for drug molecular design [37].Gu et al. utilized curriculum learning to improve the training efficiency of molecular graph learning [38].According to Figure 1 and Figure 2, we found that the short and new visits samples account for most of the entire dataset.The conventional longitudinal methods are hard to fit this pattern because lacking a flexible ability to model the scenarios where the patients do not have enough historical medication records and diagnosis information about their health condition.In this paper, we propose a sampleadapting curriculum learning algorithm to assign the difficulty of each instance automatically.

A. Electrical Health Records (EHR)
Patient EHR data contains comprehensive medical information about the patient.Formally, EHR for patient j can be represented as a sequence X j = (x 1 j , x 2 j , • • • , x T j ), where T is the corresponding totally visits number for patient j.For the single visit x t j of patient j at t−th visit, where t ∈ {1, 2, • • • , T }, we ignore the index j of patient to simplify notation.Then, the visit record is represented as x t = (D t , P t , M t ), where

B. DDI Graph
The medications may interact with other medications when prescribed, while the adverse drug-drug interactions (DDIs) graph records this interaction of adverse drug events.The DDI graph can be denoted as The E d is the edge set of known DDIs between a pair of drugs.Adjacency matrix A d ∈ R |M|×|M| are defined to the construction of the graphs.When the A d [i, j] = 1 means the i-th medication and j-th one could interact with each other.

C. Medication Recommendation Problem
Given a patient EHR sequence [x 1 , x

IV. THE SHAPE FRAMEWORK
In this section, we present the technical details of the proposed SHAPE framework.As illustrated in Figure 3, our model includes three components: (1) an intra-visit set encoder that learns the visit-level representation of the patient from the EHR data.( 2) an inter-visit longitudinal encoder that takes the visit-level representation as input to learn the longitudinal information of the patient.(3) a adaptive curriculum learning module that cooperates with the prediction phase in the training stage to dynamically assign the difficulty weight of each instance by the patient visit length to improve the effectiveness of medication recommendations.Finally, the drug output is obtained from the sigmoid output representation.

A. Patient Representation
The patient representation aims to learn a dense vector to represent a comprehensive patient's status.The physicians recommend medications based on the current diagnosis and procedure information during a clinical visit.Furthermore, the clinician also references the history of diagnosis, procedure, and medication records when the patient has historical visit records.Since the SHAPE is proposed for the generic patient, we use the three codes as the model input in the following, and the medication codes are always behind the other two medical events.Note that, for the patient who only once visit with diagnosis and procedure record, we apply a padding embedding as the medic input.
1) code-level embedding: For predict the medication of multi-visit, we use the [D t , P t , M t−1 ] as the current input, where M t−1 is the previous medication record.We design three correspond embedding table where the dim is the dimension of the embedding space.For the t−th visit, the set of medical events d (t) ∈ D t , p (t) ∈ P t , and m (t−1) ∈ M t−1 was transfer to the embedding space.
2) Intra-visit Set Encoder: Unlike the previous works [27], [28], which use the code embedding representation of the medical events as the patient representation.We employ the code-level embedding as the input of the set encoder to learn the code-level relationship and then ingrate the code-level information into the visit-level representation.Inspired by the Set-Transformer [39], we utilize inducing point methods to compress medical code representations into a more compact space for modeling the impact of medical events.The set encoder contained two Induced Set Attention Block (ISAB).In Visit-level representation  # (!)

Masked Cross attention
Linear projection

Masked Cross attention
Masked Self attention Data Input Code-level Embedding ISAB, along with the set X ∈ R m×d , define a new trainable parameters vector I ∈ R n×d , called inducing points.The ISAB has the two major sub-layers: Multi-Head Attention (MHA) and row-wise FeedForward layer (rFF), the functions are defined as: where The ISAB is defined as: where LN is layer normalization operation.The set-encoder is defined as: where * ∈ {d, p, m}.
Given the code-level embedding representation, the output of the diagnosis set encoder is formulated as follows: Similar to the diagnosis set encoder, the output of the procedure set encoder and medication set encoder are formulated as ).After obtaining the code-level set representation of the three medical events, we combine them to visit-level representation V (t) as the health status of the patient in the current visit.The visit-level representation is defined as: where the p , V is the summation of code-level representation, and [•] is the concatenate operation.
3) Inter-visit Longitudinal Encoder: Previous works usually employ Recurrent Neural Networks (RNN) to model the dynamic patient history for learning longitudinal representations of patients.As the success of the attention mechanism in sequence task [40]- [42], it will be helpful to combine the attention mechanism and RNN pattern.Inspired by the Block-Recurrent Transformer (BRT) [43], which applies a transformer layer in a recurrent fashion along the sequence input.Differing from the basic BRT, we have followed the GPT [42], added the masked vector to prevent information leaks while modeling patient longitudinal visit records, and named Recurrent Attention Block (RAB).The RAB mainly includes the update stream between the hidden state vector and the visit-level representation.The hidden state vector carries the patient temporal information, and the visit-level representation updates the information based on the historical state representation.For the state vector, the update function is formulated as follows: where M LP is multi-layer perceptron, ⊙ is the Hadamard product, is the combination of masked self-attention on the current hidden state C t and the masked cross-attention with the visit-level representation V (t) , where The update stream of visit-level representation selects the longitudinal information from the hidden state and visit-level information from the current visit and is defined as: where M LP is a multi-layer perceptron.
′ is the concatenate of visit-level representation masked self-attention and masked cross-attention with the current hidden state, where a central feature is to delegate a considerable portion of the information update responsibility to the process for generating attention weights.The formulation is: 4) Adaptive Curriculum Learning module: This module includes the prediction layer and the adaptive curriculum manager.After obtaining the updated patient-level representation V (t) , the final medication representation is generated through an output layer, which is defined as: where σ is sigmoid function, and • Supervised Multi-label Classification Loss.The recommendation of medication combinations can be treated as a multi-label prediction task.We use the binary cross entropy loss l bce as the multi-label task loss function, and l bce is defined as: where m (t) i and ŷ(t) i means the medical code at i−th coordinate at t−th visit.
• Drug-Drug Interaction Loss.The DDI loss is designed to control the DDI rate of generated medication combinations.Following the previous work [28], it is formulated as: where ⊙ is the Hadamard product.
where α is a pre-defined hyperparameter.By presetting different α, our SHAPE model could meet a different level of DDI requirements (the details of selecting the α are shown in the DISCUSSION section).• Adaptive Curriculum Manager.As shown in Figure 2 (a), although the medication combinations of most long visit records have been recommended before and are easy to predict, the short one lacking historical medication information is the most frequent situation in real-life clinical scenarios, which may be hard to predict accurately.
To address this issue, we propose an adaptive curriculum manager to adaptively assign the complex coefficient of each patient and adopt the curriculum learning framework to train our SHAPE model.Specifically, we combine the visit length of the patient into the training schema, where we calculate I+lt Imax (i.e., Eq. ( 28)) to adjust the learning rate at the Adam [44] optimizer.Intuitively, when assigning a lower learning rate to shorter patient visit lengths, the model is guided to learn more complex parameter patterns for those shorter visit records.The adaptive curriculum manager is defined as: where ϵ is a constant added to the denominator to improve numerical stability, γ is the learning rate, I is the current training iteration number, l t is the current visit length, I max is the pre-defined maximum iteration number, and µ t , η t is the parameter of the first moment and the second moment of Adam, β 1 , β 2 is the coefficient of the moment, the f (θ) is the objective function, and θ are parameters waiting to update, ∇(•) is the derivative operation.The adaptive curriculum manager is banded with the parameter update.Eq. ( 28) is the critical step of the optimizer of the objective.We use the current iteration and the current patient visit length to select the learning difficulty automatically.

B. Inference
The SHAPE is trained end-to-end, and in the inference phase, the safe drug combination recommendation is generated from the sigmoid output ŷ(t) , where we fix the threshold value as 0.5 to predict the label set.Then, the final predicted medication combinations correspond to the following: V. EXPERIMENTS In this section, we introduce the experiment details and conduct evaluation experiments to demonstrate the effectiveness of our SHAPE model1 .

A. Dataset
We use the EHR data from the Medical Information Mart for Intensive Care (MIMIC-III) 2 .It contains 46,520 patients and 58,976 hospital admissions from 2001 to 2012.We conduct experiments on a benchmark released by COGNet [29], which is based on the MIMIC-III dataset for a fair comparison.Following the COGNet, we selected Top-40 severity DDI types from TWOSIDES [45], and we converted the drug code into ATC Third Level codes 3 to align with the DDI graph nodes.Finally, we followed the setting of COGNet and divided the dataset into training, validation, and test sets by the ratio of 4 : 1 : 1.The statistics of the post-processed data are reported in Table 1.

B. Metrics
We use three efficacy metrics: Jaccard, F1, and Precision-Recall Area Under Curve (PRAUC) combinations to evaluate the recommendation efficacy.Additionally, we also showed the DDI rate, and the number of predicted medications following the previous works [28], [29].
The Jaccard for the patient is calculated as below: where the M (t) is the ground-truth medication set sequence at t−th visit and the Ŷ (t) is the predicted medication combinations.
The F1 of the patient is calculated as follows: The PRAUC is calculated with the ground truth code's predicted probability of each medication code.
where P (k) t , R(k) t are the precision and recall at the cut-off k−th threshold in the ordered retrieval list.DDI rate aims to measure the interaction between the recommended medication combinations, which is calculated as follows:

C. Baseline
We compare the SHAPE model with the following methods from different perspectives: conventional machine learning method, such as Logistic Regression(LR).Instance-based methods: LEAP [22], 4SDrug [23].Longitudinal-based methods: RETAIN [24], DMNC [25], GAMENet [26], MICRON [27], SafeDrug [28], COGNet [29].Specifically, LEAP [22] uses an attention mechanism to encode the diagnosis sequence step by step.4SDrug [23] designs an attention-based method to augment the symptom representation and leverages the DDI graph to generate the current drug sequence.RETAIN [24] employs the attention gate mechanism to model the patient longitudinal information.DMNC [25] proposes a memory network to capture more interaction in the patient EHR record.GAMNet [26] combines the RNN and graph neural network to recommend medication combinations.MICRON [27] leverages a residual-based network to update the patient representation according to the new feature change.SafeDrug [28] utilizes drugs' molecule structures in the medication recommendation.COGNet [29] proposes a conditional generation model to copy or predict drugs according to the patient representation.

D. Parameter Setting
Here, we list the implementation details of SHAPE.We set the hidden dimension as 128 and use the Adam optimizer [44] with an initial learning rate 1 × 10 −3 for 50 epochs.We fixed the random seed as 2023 to ensure the reproducibility of the model.Our model is implemented by Pytorch 1.7.1 based on Python 3.8.13 and training on two GeForce RTX 3090 GPUs, and an early-stopping mechanism was utilized.For a fair comparison, in the testing stage, we follow the previous work CONGNet [29], which random sample 80% data from  test data for a round of evaluation.We repeat this process 10 times and calculate the mean and standard deviation as the final result we reported.

E. Result Analysis
As shown in Table 2, our proposed model SHAPE outperforms all baselines with the higher Jaccard, F1, and AUPRC and increased by nearly 2% compared to the previous best model.The conventional LR and the Instance-based methods are poor as they only consider the patient's health condition at the current visit.The performance of RETAIN and DMNC are comparable because both use the RNN architecture to capture the longitudinal information.The GAMENet introduced an additional DDI graph and fused it with the EHR co-occurrence graph, resulting in further performance improvement.Safe-Drug leverages the drugs' molecule structures to improve the performance of medication recommendations.Unlike most longitudinal algorithms, which focus on the historical record, the MICRON proposed using the residual network to capture changes in medications.The COGNet proposes the copy or prediction mechanism to generate the medication sequence since the statistics show that most medication codes have been recommended in historical EHR records.However, it fails to consider the short visit, which may not be enough historical reference, especially for the newly and secondly admission patients.
Compared with the baseline methods, our SHAPE model achieves state-of-the-art performance.On the one hand, it designed an intra-visit set encoder to automatically collect the most informative medical events of each patient.On the other hand, we develop an inter-visit longitudinal encoder to capture the longitudinal pattern, which inherits the merit of RNN and the attention mechanism.Besides, our adaptive curriculum manager assigns the difficulty of each sample base on the visit length accordingly.Hence, our SHAPE performance is better than the other methods.
We also noticed in Table 2 that the 4SDrug achieves the lowest and most charming DDI rate of predicted medication combinations.However, combined with the results shown in Figure 4, the 4SDrug achieves the lowest DDI probably because the predicted medication code counts are less than other methods since we have observed that the DDI rate increase with the number of predicted medications.This lower DDI rate phenomenon also appears in the MICRON model since there are few predicted medications.
Furthermore, we noticed that the MIMIC-III dataset has an average DDI rate of 0.0875 itself, which means there is a large number of DDI phenomena in real-world practice.Based on this fact, our SHAPE also achieves a lower DDI rate and higher accuracy of medication recommendations, indicating the effectiveness of our proposed method.
To further validate that our SHAPE model can better model the short visit and even the new visit problem and recommend medication effectively, we investigate the performance of various visits with different models.As shown in the right picture of Figure 1, there are severe long-tail phenomena in the MIMIC-III dataset, and most patients have less than five times admission records.We take patients' first five visit records in the test set for visualization.We compared SHAPE with the COGNet and 4SDrug since (1) the COGNet achieves the best performance of the existing methods, and (2) the 4SDrug method uses the set-orient method to learn the code-level representation and uses the DDI loss to control the output predicted.As shown in Figure 4, our SHAPE model is superior to the COGNet on the three metrics (i.e., Jaccard, F1, and PRAUC).Especially, our SHAPE achieves higher performance in the short visit length and shows an increasing trend.These results may directly show the power of SHAPE to solve the problem shown in Figure 1, in which the short visit records are the critical samples.The higher accuracy of these samples is helpful for most situations in real-world clinical practices.On the contrary, the 4SDrug is always under the COGNet and SHAPE.The reason may be that the 4SDrug is an instancebased method that ignores temporal longitudinal information.

VI. DISCUSSION
Upon analyzing the results in Table 2, we can conclude that our proposed model SHAPE achieved the best performance compared to the LR and Instance-base and Longitudinalbase methods.The success of SHAPE is ascribed to the three modules we proposed (i.e., the Intra-visit Set Encoder (ISE), the Inter-visit Longitudinal Encoder (ILE), and Adaptive Curriculum Learning Module (ACLM)), and it achieved a lower DDI rate with our proposed combined loss function.To verify the effectiveness of each module we proposed, we designed the ablation experiments, SHAPE w/oISE : which remove the intra-visit set encoder and summarize the code-level to visit-level representation directly.SHAPE w/oILE : which uses the recurrent neural network to replace the inter-visit longitudinal encoder for learning the longitudinal information.SHAPE w/oACLM : which means removing the step of Eq. ( 28) and using the basic Adam optimizer to optimize the SHAPE.SHAPE w/oDDIloss : which only uses the multi-label classification loss function as the objective to train the model.We also compared the self-attention (SA) to investigate the effectiveness of our proposed compact intra-visit set encoder, SHAPE wSA : which replaces the set encode as self-attention.
Table 3 shows the results for the different variants of SHAPE.As expected, when randomly removing the three modules we proposed.The performance brought a significant deterioration to the complete SHAPE model.The results of the DDI rate of SHAPE w/oDDIloss illustrate the effectiveness of the combination loss function.Overall, the SHAPE outperforms all variant models, which means each component is integral to SHAPE.Compared with the SHAPE, the SHAPE wSA drops performance on total metrics, demonstrating that a more compacted encoder is more suitable to model the complex medical event code sequence.Moreover, the performance drop of SHAPEw/oACLM can be observed in Table 3, indicating that it is important to consider the visit length as the guidance to assign the complex coefficient in the model of each patient.To explore the impact of the ACLM module, we conducted experiments to visualize the loss trajectory between SHAPE and SHAPEw/oACLM .As shown in Figure 5, it can be seen that compared to SHAPEw/oACLM , SHAPE has a significant decrease in loss and converges quickly.This demonstrates the vital importance of the ACLM module, as it can automatically assign difficulty coefficients to each sample and learn more suitable parameters for various visit records.
Furthermore, to achieve a satisfactory trade-off for the DDI rate phenomenon in the medication combinations generated by SHAPE, we explore the hyperparameter α in Eq. ( 26).The details are also shown in the second half of Table 3, according to the results of Table 3, we can conclude that: (1) the DDI rate of predicted medication combinations is gradually increasing with the decline of α ddi .(2) before the α > 0.05, the performance of other metrics is suppressed, which indicates the DDI rate and the accuracy performance of the predicted medication combination almost linearly decreases with the penalty weight.However, when the α < 0.05, the performance of SHAPE fluctuated.Combined with the previously mentioned that the MIMI-III dataset has a 0.0875 DDI rate itself, which means not the lowest DDI rate is the superior optimal selection of clinical practice.
To intuitively demonstrate the advantages of SHAPE over the two baseline models, we analyze some examples to show the predicted results.We choose the short or new visit patients to demonstrate the model effect on harder predicted cases.Due to space constraints, we use the International Classification of Disease (ICD) code to represent the diagnosis and procedure information and the ATC code to represent the medications.As shown in Table 4, case 1 is a new admission patient, the doctor prescribed ground truth medication based on the diagnosis and procedure information of the patient's current visit.Case 2 is a secondary admission patient, and we list the second record in Case 2. In Case 2, the physician combines the current health condition and the patient's historical record to prescribe medication.Overall, the SHAPE performed the best with 14 correct and 19 correct medications in two cases and achieved the lowest miss or error in the two cases.Furthermore, we noticed that in new visit Case 1, the instance-based method 4SDrug also achieves comparable performance with COGNet, probably because of the instance-based approach against the single visit problem.
As shown in Figure 6, we visualize the DDI status in two cases of each model, where the symmetric matrix shows the drug-drug relationship of the combination of medications.The point of GT normal means there is no DDI in ground truth medication combinations, and GT ddi means there probably is DDI in the ground truth medication combinations.The empty rows and columns mean these codes do not appear in the ground truth medications.We noticed in Case 1 our SHAPE only generates two pairs of medication which maybe suffers the drug-drug interaction, on the contrary, the 4SDrug and COGNet generate five pairs (i.e., [A01A, R03A], [A06A, R01A], [N02A, B01A], [B01A, N02A], [B01A, R01A]) and eight pairs (i.e., [A01A, R03A], [A06A, R01A], [C07A, A12B], [C07A, R01A], [A12B, C07A], [N02A, B01A], [B01A, N02A], [B01A, R01A]).In the DDI of Case 2, we find that the DDI phenomenon in real-life scenarios exceeds ten pairs of medications.Our SHAPE simultaneously hits most situations similar to the ground truth medication prescribed by doctors, which hints that SHAPE can provide a safer way to recommend medication combinations.
There are also several limitations of the current study.Firstly, we only used diagnosis and procedure information for the side information to infer the medication and ignored others, such as vital signs and laboratory test records.Secondly, we only evaluate the SHAPE model on a public dataset, which also limits the generalizability of the model.

VII. CONCLUSION
In this paper, we proposed a sample adaptive hierarchical medication prediction network, named SHAPE, to better learn the accurate representation of the patient.Concretely, we first present an intra-visit set encoder to capture medical events relationship from the code-level perspective, which is usually ignored in most current works.Then, we developed an intervisit longitudinal encoder to learn the visit-level longitudinal representation, which inherits the merits between attention and the RNN.Additionally, we designed an adaptive curriculum learning module that references patients' personalities to automatically assign each patient's difficulty for improving the performance of medication recommendations.Experiment results on the public benchmark dataset demonstrate that SHAPE outperforms existing medication recommendation algorithms by a large margin.We also investigate the performance of short visits and new visit samples, which shows that the SHAPE can effectively figure out the medication recommendation with the short admission of patients.Further ablation study results also suggest the effectiveness of each module of our proposed SHAPE.

Fig. 1 .
Fig. 1.The histogram of visit counts of MIMIC-III dataset (left) and the histogram of Jaccard between current medications and historical medications (right).

Fig. 2 .
Fig. 2. The statistics of (a) medication overlap rate and (b) Jaccard coefficients in various visits with different window sizes.

Fig. 3 .
Fig. 3.The Framework of our proposed SHAPE.There are three components:(1) The Intra-visit Set Encoder captures the intra-relationship of the code-level medical events and summarizes it to the current visit-level representation.(2) An Inter-visit Longitudinal Encoder to model the longitudinal information of the patient.(3) An Adaptive curriculum learning module automatically assigns each sample's difficulty according to the patient's visit length.
38) where A d is the known knowledge of the DDI matrix.Ŷ (t)i denoted the i−th recommended medication and 1{•} means to return 1 when the {•} is true, otherwise, return 0.

Fig. 4 .
Fig. 4. The performance of different visit lengths with the various models.

Fig. 5 .
Fig. 5. Loss comparison on SHAPE and SHAPE w/oACLM regarding different numbers of train epochs.

Fig. 6 .
Fig. 6.Visualization DDI of the case study.Case 1 is a new admission patient.Case 2 is a secondary admission patient.In a chessboard, the red square corresponds to the DDI in the ground truth; the green point corresponds there are not appear DDI in the ground truth; the blue circle corresponds to the DDI in the predicted medications with COGNet, the inverted yellow triangle corresponds to the DDI predicted medications with 4SDrug.The purple cross corresponds to the DDI in the predicted medications with SHAPE.Best viewed in color.
denotes the set of diagnoses appeared in t-th visit, P t ⊆ {p 1 , p 2 , • • • , p |P| } denotes the set of procedures and M t ⊆ {m 1 , m 2 , • • • , m |M| } denotes the set of medications appeared in t-th visit.|D|, |P| and |M| indicate the cardinality of corresponding element sets. 2 , • • • , x t ] and the DDI graph G d .For the multi-visit records patient, which includes the current diagnosis, procedure codes [D t , P t ] and the historical records [x 1 , x 2 , • • • , x t−1 ].Note that, for the record of new visit patients, there are only current diagnosis and procedure codes [D 1 , P 1 ].The goal is to train a model to effectively recommend multiple medications by generating multi-label output ŷt ⊆ {m 1 , m 2 , • • • , m |M| } for this patient.
classification loss and the DDI loss.Finally, we use a penalty weight α over the DDI loss for training.The final loss function is defined as: • Combined Loss Functions.During the training, we noticed that the accuracy and the DDI rate often increase together, mainly due to the drug-drug interaction in realworld clinical scenarios.It is important to balance the multi-label

TABLE II PERFORMANCE
COMPARISON ON THE MIMIC-III DATASET, THE BEST RESULTS ARE HIGHLIGHTED BOLD

TABLE III ABLATION
STUDY FOR DIFFERENT SHAPE MODULES ON MIMIC-III DATASET.