COVID-19-CT-CXR: A Freely Accessible and Weakly Labeled Chest X-Ray and CT Image Collection on COVID-19 From Biomedical Literature

The latest threat to global health is the COVID-19 outbreak. Although there exist large datasets of chest X-rays (CXR) and computed tomography (CT) scans, few COVID-19 image collections are currently available due to patient privacy. At the same time, there is a rapid growth of COVID-19-relevant articles in the biomedical literature, including those that report findings on radiographs. Here, we present COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images, which are automatically extracted from COVID-19-relevant articles from the PubMed Central Open Access (PMC-OA) Subset. We extracted figures, associated captions, and relevant figure descriptions in the article and separated compound figures into subfigures. Because a large portion of figures in COVID-19 articles are not CXR or CT, we designed a deep-learning model to distinguish them from other figure types and to classify them accordingly. The final database includes 1,327 CT and 263 CXR images (as of May 9, 2020) with their relevant text. To demonstrate the utility of COVID-19-CT-CXR, we conducted four case studies. (1) We show that COVID-19-CT-CXR, when used as additional training data, is able to contribute to improved deep-learning (DL) performance for the classification of COVID-19 and non-COVID-19 CT. (2) We collected CT images of influenza, another common infectious respiratory illness that may present similarly to COVID-19, and fine-tuned a baseline deep neural network to distinguish a diagnosis of COVID-19, influenza, or normal or other types of diseases on CT. (3) We fine-tuned an unsupervised one-class classifier from non-COVID-19 CXR and performed anomaly detection to detect COVID-19 CXR. (4) From text-mined captions and figure descriptions, we compared 15 clinical symptoms and 20 clinical findings of COVID-19 versus those of influenza to demonstrate the disease differences in the scientific publications. Our database is unique, as the figures are retrieved along with relevant text with fine-grained descriptions, and it can be extended easily in the future. We believe that our work is complementary to existing resources and hope that it will contribute to medical image analysis of the COVID-19 pandemic. The dataset, code, and DL models are publicly available at https://github.com/ncbi-nlp/COVID-19-CT-CXR.


INTRODUCTION
T HE latest threat to global health is the ongoing outbreak of the COVID-19 caused by SARS-CoV-2 [1]. So far, pneumonia appears to be the most frequent and serious manifestation, and major complications, such as acute respiratory distress syndrome (ARDS), can present shortly after the onset of symptoms, contributing to the high mortality rate of COVID-19 [2], [3], [4]. Chest X-rays (CXR) and chest computed tomography (CT) scans are playing a major part in the detection and monitoring of these respiratory manifestations. In some cases, CT scans have shown abnormal findings in patients prior to the development of symptoms and even before the detection of the viral RNA [5], [6], [7].
With the shortage of specialists who have been trained to accumulate experiences with COVID-19 diagnosis, there has been a concerted move toward the adoption of artificial intelligence (AI), particularly deep-learning-based methods, in COVID-19 pandemic diagnosis and prognosis, in which well-annotated data always play a critical role [8]. Although there exist large public datasets of CXR [9], [10], [11] and CT [12], there are few collections of COVID-19 images to effectively train a deep neural network [13], [14], [15]. Nevertheless, we have seen a growing number of COVID-19 relevant articles in PubMed [16], [17]. In addition, there is a recent COVID-19 initiative to expand access via PubMed Central Open Access (PMC-OA) Subset to coronavirusrelated publications and associated data (https://www. ncbi.nlm.nih.gov/pmc/about/covid-19-faq/). As a result, more articles ( > 10; 000 as of May 9, 2020) relevant to the COVID-19 pandemic or prior coronavirus research were added through PMC-OA with a free-reuse license for secondary analysis.
Non-textual components (e.g., figures and tables) provide key information in many scientific documents and are considered in many tasks, including search engine and knowledge base construction [18], [19]. As such, we have recently seen a growing interest in mining figures within scientific documents [20], [21], [22]. In the medical domain, figures also are a topical interest because they often contain graphical images, such as CXR and CT [23], [24]. Extracting CXR and CT from biomedical publications, however, is neither well studied nor well addressed.
For the above reasons, there is an unmet need to construct the COVID-19 image dataset from PMC-OA to allow researchers to freely access the images along with a description of the text. In this paper, we thus introduce an effective framework to construct a CXR and CT database from PMC-OA and propose a public database, termed COVID-19-CT-CXR. In contrast to previous approaches that relied solely on the manual submission of medical images to the repository, in this work, figures are automatically collected by using the integration of medical imaging and naturallanguage processing with limited human annotation efforts. In addition, figures in this database are partnered with text that describes these cases with details, a feature not found in other such datasets.
The framework consists of three steps. First, we extracted figures, associated captions, and relevant figure descriptions in the PMC-OA article. Such extraction is non-trivial due to the diverse layout and large volume of articles in the PMC-OA subset. Second, we separated compound figures into subfigures, as medical figures often comprise multiple image panels [21], [24]. Third, we classified subfigures into CXR, CT, or others because a large portion of figures in COVID-19 articles are not CXR or CT. To this end, we designed a deep-learning model to distinguish them from other figure types and to classify them accordingly.
We further demonstrate the utility of COVID-19-CT-CXR through a series of case studies. First, using this database as additional training data, we show that existing deep neural networks can receive benefits in the task of COVID-19/non-COVID-19 classification of CT images. Second, we demonstrate that the database can be used to develop a baseline model to distinguish COVID-19, influenza, and other CT, a less-studied topic. Third, we train an unsupervised one-class classifier from non-COVID-19 CXRs and performed anomaly detection to detect COVID-19 CXRs. Fourth, we extract symptoms and clinical findings from the text, using the natural languageprocessing methods. The symptoms and clinical findings not only confirm the results that radiologists have found but also potentially identify other findings that may have been overlooked.
The remainder of the paper is organized as follows. Section 2 presents the material and methods to build the dataset. Section 3 contains the details of the statistics of the dataset, results of the image type classification, and the use cases. Finally, Sections 4 and 5 provide the discussion, conclusions, and recommendations for future work. . Relevant articles are identified and curated with assistance from an automated machine-learning and text-classification algorithm. As of May 9, 2020, there were 5,381 PMC-OA articles in the collection ( Table 1). The topics of articles ranged from diagnosis to treatment to case reports.

Text Extraction
In this step, we identify figure captions and relevant text with the referenced figures. To facilitate the automated processing of full-text articles in PMC-OA, [25] convert PMC articles to BioC format, a data structure in XML for text sharing and processing. Each article in BioC format is encoded in UTF-8, and Unicode characters are converted to strings of ASCII characters. The article also includes section types, figures, tables, and references [26]. In this study, we We then used the figure number and regular expressions to find where the figure is cross-referenced in the document. Fig. 2 shows an example of a typical biomedical image in the article, "A rapid advice guideline for the diagnosis and treatment of 2019 novel coronavirus (2019-nCoV) infected pneumonia (standard version)" [27]. The examples contain CXR, CT, a figure caption, and text that describes the case with rich information, such as fever, symptoms, and clinical findings.

Subfigure Separation
Most of the figures in the PMC-OA articles are compound figures. A key challenge here is that one figure may have individual subfigures of the same category (e.g., four CT images) or several categories (e.g., one CXR and one CT image placed side by side). For example, Fig. 2 contains a compound figure with three subfigures [27]. Figs. 2a and 2b are CT images, and Fig. 2c is a CXR. Notably, it is a requirement to decompose compound figures into subfigures before modality classification. In this study, we used a convolutional neural network developed by [24] to separate compound figures. The model was pretrained on the Image-CLEF Medical dataset with an accuracy of 85.9 percent [28]. We applied the model on the figures obtained in previous steps and filtered the subfigures with a size smaller than 224 x 224 pixels. We consider that subfigures with fewer pixels might be deformed, and most state-of-the-art neural networks in image analysis, such as Inception-v3 [29] and Den-seNet [30], require an input size of 224 or larger.

Image Modality Classification
A large portion of figures in the PMC-OA articles are not CXR or CT images. To distinguish them from other types of scientific figures, we designed a scientific figure classifier that was fine-tuned on a newly created dataset (https:// github.com/ncbi-nlp/COVID-19-CT-CXR). Table 2 shows the breakdown of the figures by their category in the training and test set. This dataset consists of 2,700 figures in three categories: CXR, CT, and Other scientific figure types. A total of 500 CXRs are randomly picked from the NIH Chest  X-ray [11], and 500 CT images are randomly picked from DeepLesion [12]. Other scientific figures are randomly picked from DocFigure [31]. The original DocFigure annotated figures of 28 categories, such as Heat map, Bar plots, and Histogram. Here, we combined these categories into one for simplicity of training the classifier. In addition, we curated 1,200 figures from PMC-OA, using the annotation tool developed by [32]. Our framework uses DenseNet121 to classify image types [33]. The weights (or parameters) were pretrained on ImageNet [34]. We replaced the last classification layer with a fully connected layer with a softmax operation that outputs the approximate probability that an input image is a CXR, CT, or other scientific figure type. All images were resized to 224 x 224 pixels. The hyperparameters include a learning rate of 0.0001, a batch size of 16, and 50 training epochs. All experiments were conducted on a server with an NVIDIA V100 128G GPU from the NIH HPC Biowulf cluster (http://hpc.nih.gov). We implemented the framework using the Keras deep-learning library with Tensor-Flow backend (https://www.tensorflow.org/guide/keras).

Qualification and Statistical Analysis
The performance metrics include the area under the receiver operating characteristic curve (AUC), sensitivity, specificity (recall), precision (positive predictive value), and F1 score. For the classification problem, we chose the label with the highest probability when required in computing the metrics. Each of the models was fine-tuned and tested five times, using the same parameters, training, and testing images each time. The validation set was randomly selected from 10 percent of the training set. Fisher's exact test was used to determine whether there are nonrandom associations between COVID-19 and influenza's symptoms and clinical findings [35]. We conduct above statistical analysis using numpy, scipy, matplotlib, and scikit-learn built on Python. Table 3 shows the breakdown of the figures by modality. We obtained 1,327 CT images and 263 CXR text-mined labeled as positive for COVID-19 from 1,831 PMC-OA articles. These images have different sizes. The minimum, maximum, and average heights are 224, 2,703, and 387.5 pixels, respectively. The minimum, maximum, and average widths are 224, 1,961, and 472.4, respectively. For each article, we also include major elements, such as DOI, title, journal, and publication date for reference. Fig. 3 A shows the cumulative numbers of articles and figures on a weekly basis. We analyzed the proportional distribution of categories in COVID-19 relevant PMC-OA articles, and articles with figures, CT, and CXR. Fig. 3 B shows that the "Case Report" category contains higher proportional articles with CXR/CT. Table 4 shows the performance of the model to classify image modality. The macro average F-score is 0.996. The Fscore was 0.993 AE 0.004 for CT, 1.000 AE 0.000 for CXR, and 0.998 AE 0.001 for other scientific figure types.

Use Cases
To demonstrate the utility of COVID-19-CT-CXR, we conducted four case studies. (1) We combined COVID-19-CT-CXR with previously curated data at https://github.com/ UCSD-AI4H/COVID-CT [36] and fine-tuned a deep neural network to perform the classification of COVID-19 and non-COVID-19 CT. (2) We collected CT of influenza, using a similar method, and fine-tuned a deep neural network to distinguish among the diagnoses of COVID-19, influenza, and normal or other types of diseases on CT. (3) We fine-tuned an unsupervised one-class learning model, using only non-COVID-19 CXR to perform anomaly detection, to detect  We then compared their frequencies to those described in articles on influenza, another common infectious respiratory illness that may present similarly to COVID-19.

Classification of COVID-19 and non-COVID-19 on CT
In the context of the COVID-19 pandemic, it is important to separate patients likely to be infected with COVID-19 from other non-COVID-19 patients. As it is time-consuming for specialists to both accumulate experiences and read a large volume of CT scans to diagnose COVID-19, many studies use machine learning to separate COVID-19 patients from non-COVID-19 patients [14], [37], [38], [39], [40]. In this work, we hypothesize that our creation of additional training data from existing articles can improve the performance of the system and reduce the effort of manual image annotation. To test this hypothesis, we compared the performance of deep neural networks fine-tuned on the existing benchmark [36] and COVID-19-CT-CXR (Table 5). For a fair comparison, we added additional training examples only in the training set and used the same test set as described in [14]. In this experiment, DenseNet121 was pre-trained on ImageNet, fine-tuned, and evaluated on the training and test sets. We then replaced the last classification layer with a single neuron with sigmoid that outputs the approximate probability that an input image is COVID-19 or non-COVID-19. Other experimental settings are the same as that of fine-tuning the image modality classifier. Fig. 4 shows that the model significantly outperforms the baseline when PMC-OA CT figures were added for fine-tuning. Specifically, we achieved the highest performance of 0.891 AE 0.012 in AUC, 0.780 AE 0.074 in recall, 0.816 AE 0.053 in precision, and 0.792 AE 0.015 in F-score (  (Table 7).
To obtain the baseline model, we use the same model and experimental settings as described in the "Image modality    classification" section. Fig. 5 shows the performance of the deep-learning model by its receiver operating characteristic (ROC) curves. The AUC was 0.855 AE 0.012 for COVID-19 detection and 0.889 AE 0.014 for influenza detection. Table 8 shows more detail for the results. We achieved the highest precision (0.845 AE 0.026) for COVID-19 detection and high recall (0.711 AE 0.053) for influenza detection.

Anomaly Detection of COVID-19 in CXR Using One-Class Learning
As they lack annotated COVID-19 CXR for training powerful deep-learning classifiers, unsupervised and semi-supervised approaches are highly desired for automated COVID-19 diagnosis. The presence of COVID-19 can be considered a novel anomaly in CXR for the NIH Chest X-ray dataset, in which no COVID-19 cases are available. In this experiment, we performed anomaly detection [44], [45] to detect COVID-19 CXR. We trained a one-class classifier, using only non-COVID-19 CXR, and used this classifier to distinguish COVID-19 CXR from non-COVID-19 CXR. The non-COVID-19 images were a subset extracted from the NIH Chest X-ray dataset by combining 14 abnormalities and a no-finding category. The detailed numbers of training and testing CXR are shown in Table 9. We adopted the generative adversarial one-class learning approach from [46]. Fig. 6 shows the performance of the unsupervised one-class learning by its ROC curves. Table 10 shows more detail for the results. Our model achieved 0.828 AE 0.019 in AUC, 0.767 AE 0.020 in precision, 0.772 AE 0.017 in recall, and 0.769 AE 0.018 in F-score for COVID-19 anomaly detection.

Extraction of Clinical Symptoms and Findings Using Text-Mining
In this case, we extracted clinical symptoms or signs from the figure captions and relevant text that describes the case. A total of 15 symptoms or signs were collected from [3] and the CDC website (https://www.cdc.gov/coronavirus/2019ncov/symptoms-testing/symptoms.html), including chest pain, constipation, cough, diarrhea, dizziness, dyspnea,    fatigue, fever, headache, myalgia, proteinuria, runny nose, sputum production, throat pain, and vomiting. Extracting these symptoms from text is a challenging task because their mentions in the text can be positive or negative. For example, "fever" is negative in the sentence, "She experienced headache and pharyngalgia but no fever on 29 January." To discriminate between positive and negative mentions, we applied our previously developed tool, Neg-Bio, on the figure caption and referred text [47]. In short, NegBio utilizes patterns in universal dependencies to identify the scope of triggers that are indicative of negation; thus, it is highly accurate for detecting negative symptom mentions. Fig. 7 A shows the proportion of symptoms for COVID-19 and influenza. The most common symptoms are fever, cough, dyspnea, and myalgia.
We then extracted the radiographic findings from the figure caption and text. The findings (and their synonyms) are based on 20 common thoracic disease types, which are expanded from NIH Chest X-ray 14 labels [11]. Fig. 7 B shows the 20 findings in both COVID-19 and influenza datasets. Both illnesses can result in lung opacity, pneumonia, and consolidation. COVID-19 more likely results in ground-glass opacification (GGO), while influenza more likely results in infiltration than does COVID-19 (Fisher's exact test, p < 0:0001).

DISCUSSION
In this abrupt outbreak of SARS-CoV-2, the demand for chest radiographs and CT scans is growing rapidly, but there is a shortage of experienced specialists, radiologists, and researchers. Further, we are still new to this virus and have yet to discover the full radiologic features and prognosis of this disease. The tremendous increase in the number of patients has led to a substantial increase of COVID-19related PMC-OA articles over the past few months (Figur 3  A), especially in the case report and diagnosis-relevant articles (Fig. 3 B). These articles contain rich chest radiographs and CT images that are helpful for scientists and clinicians in describing COVID-19 cases. Thus, it is important to analyze these images and text to construct a largescale database. By using the quickly increasing dataset, AI methods can help to find significant features of COVID-19 and speed up the clinical workload. Among others, deep learning is undoubtedly a powerful approach in dealing with a pandemic outbreak of COVID-19.
Although deep learning has shown promise in diagnosing/screening COVID-19, using CT, it remains difficult to collect large-scale labeled imaging data, especially in the public domain. In this work, we present a set of repeatable techniques to rapidly build a CT and CXR dataset of COVID-19 from PMC-OA COVID-19-relevant articles. The strength of the study lies in its multidisciplinary integration of medical imagining and natural-language processing. It provides a new way to annotate large-scale medical images required by deep-learning models.
An additional strength includes a highly accurate model for image type classification. As a large portion of figures in the PMC-OA articles are not CXR or CT images, we provided a model to classify these two types from other scientific figure types. Our model achieved both high precision and high recall (Table 4).
To assess the hypothesis that deep neural network finetuning on this additional dataset enables us to diagnose COVID-19 with almost no hand-labeled data, we conducted several experiments. First, we showed that this additional data enable significant performance gains to classify COVID-19 versus non-COVID-19 lung infection on CT (Fig. 4 and Supplementary Table 6, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TBDATA.2020.3035935). For our own system,  we show that our baseline performance compares favorably to the results in [14]. Then, we added more automatically labeled training data and achieved the highest performance of 0.891 AE 0.012 in AUC. The comparison shows that, with additional data, both precision and recall substantially improve (7.4 and 6.6 percent, respectively). This observation indicates that additional COVID-19 CT helps to not only find more but also to restrict the positive predictions to those with the highest certainty in the model. In a more challenging scenario, we built a baseline system to distinguish COVID-19, influenza, and no-infection CT, which is a more clinically interesting but also more challenging task. We observed that we could achieve high AUCs for both COVID-19 and influenza detection. The recall of COVID-19 detection and the precision of influenza, however, are low (0.597 AE 0.030 and 0.609 AE 0.033, respectively).
Although several studies have tackled this problem [43], to the best of our knowledge, there is no publicly available benchmarking. The differentiation between COVID-19 and influenza on CXR/CT without associated context is challenging. In the experiment on classification of COVID-19, influenza, and other types of disease on CT, we found that although many of the CT findings had overlapping findings, "mixed GGO (Ground glass opacity)" were mostly found in the COVID-19 dataset and "pleural thickening" and "linear opacities" were mostly found in the influenza dataset. It is also worthy to note that the images from PMC-OA may not represent the typical pool of influenza pneumonia real-world images, since researchers may report extreme cases instead of typical cases. While our work only scratches the surface of the classification of COVID-19, influenza, and normal or other types of diseases, we hope that it sheds light on the development of generalizable deep-learning models that can assist frontline radiologists.
In addition, we presented a one-class learning model for anomaly detection of COVID-19 in CXR by learning only from non-COVID-19 radiographs. Compared to the CT-based method, the one-class model achieves comparable performance, showing great potential in discriminating COVID-19 from CXR. The performance of our model, however, is worse than that of [45], suggesting that this weakly labeled dataset should be used as additional training data obtained without additional annotation cost from existing entries in curated databases.
The unique characteristic of our database is that figures are retrieved along with relevant text that describes these cases in detail. Thus, text mining can be applied to extract additional information that confirms the existing results and potentially identifies other findings that may have been overlooked. As proof of this concept, we extracted clinical symptoms and findings from the text. We found that the most common symptoms of COVID-19 were fever and cough (Fig. 7 A), which are consistent with the clinical characteristics in [15]. Other common symptoms include dyspnea (shortness of breath), fatigue, and throat pain. These symptoms are consistent with those reported by the CDC. When comparing the frequencies of these 20 clinical findings to those described in articles on influenza, Fig. 7 shows that both conditions cause lung opacity, pneumonia, and consolidation. Further, GGO appears more frequently for COVID-19, whereas "infiltration" appears more frequently for influenza. This is because radiologists use the term GGO to describe most COVID-19 findings. In addition, the influenza articles are older than are the COVID articles, and, according to Fleischner Society recommendations, the use of the term infiltrate remains controversial, and it is recommended that it no longer be used in reports [48].
In terms of limitations, first, the subfigure segmentation model needs to be improved. In this study, we applied a deep-learning model that was pretrained on an ImageCLEF Medical dataset to this task [24]. Although this model is robust to variations in background color and spaces between subfigures, it sometimes fails to recognize similar subfigures that are aligned very closely. Unfortunately, these cases appear more frequently in our study than in others (e.g., several CT images are placed in a grid). Other errors occur when the model incorrectly treated the spine as spaces in the anteroposterior (AP) chest X-ray and split the large figure into two subfigures. In the future, the figure synthesis approach should be applied to augment the training datasets. Another limitation is that this work extracted only the passage that contains the referred figure. Sometimes, the case is not described in this passage. In the future, we plan to text mine the associated case description in the full text. Finally, while a figure is typically copyrighted with the original article and using previously published figures is not a common practice in scholarly publications, it is possible that one image is reused in different papers or reused in one paper for different purposes. In the future, we plan to develop a model to remove duplicated images in the collection.

CONCLUSION
We have developed a framework for rapidly constructing a CXR/CT database from PMC full-text articles. Our database is unique, as figures are retrieved along with relevant text that describes these cases in detail, and it can be extended easily in the future. Hence, the work is complementary to existing resources. Applications of this database show that our creation of additional training data from existing articles improves the system performance on COVID-19 versus non-COVID-19 classification in CT and CXR. We hope that the public dataset can facilitate deep-learning model development, educate medical students and residents, help to evaluate findings reported by radiologists, and provide additional insights for COVID-19 diagnosis. With an ongoing commitment to data sharing, we anticipate increasingly adding CXR and CT images to be made available as well in the coming months. The code that extracts the text from PMC, segments subfigures, and classifies image modality is openly available at https://github.com/ncbi-nlp/COVID-19-CT-CXR.

ACKNOWLEDGMENTS
This work was supported in part by the Intramural Research Programs of the National Library of Medicine (NLM) and National Institutes of Health (NIH) Clinical Center. This work also was supported by NLM under Grant 4R00LM013001. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih. gov). This material is also based upon the work supported by Google Cloud. Yifan Peng received the PhD degree. He is currently an assistant professor with Weill Cornell Medicine. He was a research fellow with the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH). His main research interests include biomedical and clinical natural language processing and medical image analysis. He has published many papers in top journals and conferences, including the Nucleic Acids Research, npj Digital Medicine, Journal of the American Medical Informatics Association, CVPR, and MICCAI. He is also an academic editor of the PLoS ONE. Sungwon Lee received the MD and PhD degrees. She is currently a radiologist and research fellow with the National Institutes of Health (NIH). Her research interests include segmentation and classification of medical imaging, especially chest, body, and musculoskeletal images of CT and MRI.
Yingying Zhu received the PhD degree. She is currently a staff scientist with the Department of Radiology, Clinical Center, National Institutes of Health (NIH). Her main research interests include computer vision, medical image analysis, and machine learning. She has published many papers in top journals and conferences, including the IEEE Transaction on Medical Imaging, the Medical Image Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, ECCV, CVPR, IPMI, and MICCAI.
Ronald M. Summers received the MD and PhD degrees. He is currently a senior investigator with the NIH. He joined the Diagnostic Radiology Department, NIH Clinical Center, in 1994. He directs the Imaging Biomarkers and Computer-Aided Diagnosis (CAD) Laboratory. His research interests include virtual colonoscopy, CAD, multi-organ multi-atlas registration, and development of large radiologic image databases. His clinical areas of specialty are thoracic and gastrointestinal radiology and body cross-sectional imaging. His current research focuses on developing fully-automated interpretation of abdominal CT scans.
Zhiyong Lu received the PhD degree. He is currently a deputy director for Literature Search at the National Center for Biotechnology (NCBI), leading its overall efforts of improving literature search and information access in NCBI's production resources. He is also an NIH senior investigator (early tenure) and directs the Text Mining / Natural Language Processing (NLP) Research Program, NCBI/NLM where they are developing computational methods and software tools for analyzing and making sense of unstructured text data in biomedical literature and clinical notes towards accelerated discovery and better health.