Reproducibility in Computing Research: An Empirical Study

In computing, research findings are often anecdotally faulted for not being reproducible. Numerous empirical studies have analyzed the reproducibility of a variety of research. Our objective, in this study, is to quantify the current state of reproducibility of research in computing based on prior research, using three reproducibility factors—Method, Data and Experiment—to measure three different degrees of reproducibility. Twenty-five variables traditionally utilized to document reproducibility are identified and grouped into three factors, namely Method, Data and Experiment. These variables describe the extent to which these factors are documented for each paper. Approximately 100 randomly selected research papers from the International Conference on Information Systems series, for the year 2019, are surveyed. Our findings suggest that none of the papers documented all the variables. In fact, the results show that relatively few variables for each factor are documented. Some of the variables vary across different categories of papers, and most papers fail in at least one of the factors. Reproducibility scores decrease with increased documentation requirements. Reproducibility may improve over time, as researchers prioritize reproducibility and utilize methods that ensure reproducibility. Research documentation in computing is remarkably limited, resulting in a dearth of reproducible factors. Future research may study the shifts and trends in reproducibility over time. Meanwhile, researchers and publishers must increase their focus on the reproducibility aspects of their papers. This study contributes to our understanding of the status quo of reproducibility in computing research.


I. INTRODUCTION
While reproducibility is historically accepted as a measure of trustworthy science, in recent years there has been a renewed and urgent focus on this area of research [1]- [3]. Certainly, reproducibility should automatically be a critical consideration of every research paper [4]. Not only does reproducibility allow researchers to build on published results but it also facilitates the review process [5], [6]. Reproducible research is becoming an imperative, ensuring transparency and building trust. In addition, reproducibility supports the sharing of methodologies, optimizing collaboration and the rapid dissemination of research [7]. Recently, however, researchers in various disciplines have raised concerns about the reproducibility of published results [8]- [10]. A 2016 survey in Nature found that many of these scientists across a wide The associate editor coordinating the review of this manuscript and approving it for publication was Binit Lukose . range of disciplines had a personal experience of failing to reproduce a result, and that most scientists believed that science was currently facing a 'significant' reproducibility crisis [11], [12]. Key outlets such as the WSJ [13], the Economist [14] and the Atlantic [15]- [17] have all published extended pieces on reproducibility. Thus, reproducibility is not only a challenge in computing; rather, it is pervasive challenge across most disciplines. The fields of psychology [18], biology [19], [20], biomedicine [9], neuroscience [21], drug development [22], chemistry [23], climate science [24], economics [25] and education [26] among others, have reported reproducibility problems [20]. A recent study estimated the cost of funding irreproducible research at approximately $28 billion a year in the U.S. alone [27], [28]. A wellknown effort to replicate findings from prominent social and cognitive psychology studies showed fewer significant findings and smaller effect sizes than the original studies [18]. And while reproducibility is considered a fundamental aspect VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of reliable research, studies show that a substantial number of published research results cannot be reproduced [11], [18], [29]- [35]. This circumstance is particularly true for the papers presented at major conferences and published in top journals. In many cases, even the primary researchers are unable to reproduce their own findings [22], [36]- [38]. In principle, it should be possible to specify a methodology with sufficient detail that anyone can reproduce it exactly, and yet, practically speaking, there are fundamental, technical and social barriers to doing so [38]. The reproducibility problem is more pronounced in computing research perhaps because the computing discipline is multidisciplinary, and the artifacts, both tangible and intangible, are developed and validated in the context of socio-technical approaches to research [39]- [44]. Computing research spans sub-disciplines that include business computing, compilers, embedded and real-time systems, networking, operating systems, user-centered applications and mobile and web applications, among others [45]. In the sub-discipline of software development, advances in research and applications are aided by algorithms, programming languages, tools and models of quality assurance and testing and so forth. But design and coding are subjective processes. Reproducibility is difficult to effectuate when there is no proper documentation of design specifications, pseudocode, prototype, etc. of how the artifacts are developed [45]. Version control is an added challenge. Compounding the problem are big datasets. Meanwhile, the computational methods necessary to process and analyze those datasets has prompted new ways of considering reproducibility [3]. In addition to a substantial lack of reproducibility in computing research, identifying reproducibility problems is itself challenging [2], [3]. This is due to the lack of both frameworks and methods, and the tools necessary to identify reproducibility problems. Further, while there is a lot of buzz about reproducibility, there are very few studies that have actually assessed reproducibility [46]- [48] and there are a scant number of frameworks or models to evaluate reproducibility [49]- [52]. This empirical study attempts to fill many of these gaps. This is a descriptive analytic study that sheds light on the current state of reproducibility in computing by examining papers from a recent conference on information technology. Adapting an existing model in the study of reproducibility in artificial intelligence research and applications [31]- [33], we develop and offer a framework and check list for undertaking reproducibility studies in computing in general. Further, this framework is operationalized with an applied, hands-on check list to evaluate the studies. The usefulness is its potential to be applied to a research paper or a report prior to submission and peer review, or publication. In other words, it provides ex ante support as opposed to other models (e.g., code testing) that are ex post [49], [50], [53]. The model described here can be applied to all aspects of a research paper or publication, namely, data, experiment (or analysis) and method. Lastly, the framework outlined and applied here is not restricted to one or another subdiscipline of computing; rather it can be applied across the board [39], [54]. The framework and check list described here will be useful to both researchers and practitioners in the reproducibility assessments of their work. We build on prior work to argue that numerous factors-including those falling under documentational, experimental and methodological categories-prevent a high degree of reproducibility of computing research [55], [56]. Empirical work that studies reproducibility in the different sub-disciplines of computing has been sporadic and ad hoc at best. This study aims to fill that gap by investigating and shedding light on the nature and dimensions of the reproducibility of current research in the computing discipline. We examine papers published as part of the proceedings of a prestigious business computing and information systems conference, applying an adapted reproducibility evaluation framework and methodology from [31] and [32].
The rest of the paper is organized as follows: First, we provide a comprehensive review of reproducibility. We then provide an overview of the reproducibility framework used in this study and follow this section with a description of research methods. We then provide the results and analysis of our study. Finally, we highlight the scope and limitations of our study and offer conclusions.

II. LITERATURE REVIEW
Publications are at the epicenter of academic life, observe [44]. Computing is in a unique position among scientific disciplines because researchers in the discipline typically eschew the publication process and disseminate their cuttingedge research at conferences. Unlike peer-reviewed publications with multiple review layers, conferences utilize an entry process with a single review stage. Thus, conferences have had a profound impact on the way research is conducted by computing researchers and have provided those researchers with a distinct advantage. To be competitive in the academic world, researchers must play the publishing game, which emphasizes numerical metrics of success [44]. The pressure to publish innovative ideas is biased towards bringing preliminary findings to the public arena as quickly as possible and circumventing the thoughtful, if relatively lengthy, peer evaluation and review process that has been the cornerstone of good research. Compounding this situation is the trend that novelty is replacing research grounded in theory [44]. The inevitable outcome is the degradation of research quality. Simultaneously, computer-based models and tools are being used in scientific research at an exponential rate, but reproducibility methods have not kept pace, leading to skepticism about the results generated by computational methods [39], [57]. As a result, a currently popular discourse is the promotion of awareness and policies designed to intervene, such as the contemporary Association for Computing Machinery (ACM) policy on the scrutiny of outputs and systems and badging [58]. The ACM construes research to be reproducible [29] when its findings can be generated by another team utilizing a different dataset. Journals, too, have begun to demand better documentation and, to the extent possible, more openness (e.g., making data available publicly). IEEE has also set up The Ad Hoc Committee on Open Science and Reproducibility. The goal of the 2020 Ad Hoc Committee on Open Science and Reproducibility ''is to analyze models, practices and experiences in supporting open science and reproducibility within the IEEE Computer Society (CS) and at peer societies and publishers'' [59]. Against this backdrop, many studies have emerged that look at aspects of reproducibility across the different sub-disciplines of computing. For example, in a recent IEEE study approximately 60% of IEEE conferences, magazines and journals have no policies and procedures in place to ensure research reproducibility [60]. In another example, [61] report that fewer than approximately 15% of MobiHoc papers (2000-2005) that utilized simulations (114 out of 151 papers) for MANET analysis were repeatable. [62] verified 134 papers published in the IEEE Transactions on Image Processing and found that only 33% of the papers published the datasets, while only 9% of the papers made available the code needed for reproducibility. Recently, [5] looked at about 600 papers from ACM conferences and journals and identified repeatability weaknesses in approximately 32% of the papers. Their study also found that a few researchers were unwilling to share their code and data. In instances where they were shared, too little information was provided to repeat the experiment. [63] evaluated the computational reproducibility of 204 papers and their ability, as independent researchers, to acquire the resources necessary to reproduce a paper's findings. The authors were able to retrieve the tangible products from 44% of the sample but only able to reproduce the results for 26% [63]. [64] analyzed data from Scopus, which showed that the reproducibility problem was prevalent in several other fields as well. [65] suggested that it was difficult to confirm most results in current conferences. Recent studies [5], [66], [67] have also shown that the peer-review process by itself is incapable of ensuring reproducibility, an obvious point given the process is not designed to check for reproducibility. Additionally, according to [68], the ''publish or perish'' mentality is a significant problem: ''Innovative findings produce the rewards of publication, employment and tenure; replicated findings produce a shrug.'' [67] and [69] suggest that in the future reproducible submissions should always be the default and that doing reproducible research will become imperative [6]. To that end, scientists, institutions and funding agencies have been pushing for the development of methodologies and tools that preserve software artifacts. Still, the consensus is that long-term reproducibility remains, in computing research, elusive [70]. This is a problem given that the scientific method depends on reproducibility to back up the development of scientific knowledge. When scientists cannot conduct the same experiment and obtain the same findings as the initial researchers, the event implies the hypothesis is false [71]. Therefore, the failure to reproduce findings affects the very integrity of science [9], [67], [72], [73]. To wit, there are very few empirical studies of reproducibility in computing [74], [75], and the few studies that were done focused on granular methods to test reproducibility such as access to data, code compilation, software quality testing, etc. [39], [76]. Furthermore, not only are there very few studies on reproducibility but there are also even fewer methods to study reproducibility [77]- [79]. Thus, there is a paucity of studies as well as methods making it beneficial to undertake additional studies and develop broader methods. This is a motivation for our study. To reiterate, this study makes a modest attempt to shed light on both aspects. In addition, it is only recently that attempts are being made to develop automated tools to assist with reproducibility [77], [79]- [81]. However, nearly all of these are at the development stage [79], [81]. In summary, very few studies exist, as highlighted above, and they have typically focused on ex post reproducibility. This study is different in that it conducts a reproducibility evaluation before a paper is published. To ensure reliability in computing research, steps must be taken to increase the reproducibility of the research [31]- [33], [82]- [85]. In the meantime, the current -and unfortunate -state of reproducibility in computing research must be documented.
Our goal with this study is to assess the current state of reproducibility in empirical computing research. Our chief proposition is that the documentation in computing research is insufficient to reproduce the published findings; that is, current documentation practices at top business, computing, and academic conferences cede much of the published findings to non-reproducibility. We surveyed research papers from the most prestigious information systems conference, namely ICIS, to test the proposition. Our research contributions are multi-fold: (i). we assess the contemporary status of reproducibility in computing research and provide a panoramic overview by conducting an empirical analysis; (ii). we develop a framework and operationalize it with a check list to verify reproducibility in computing papers, and (iii). we investigate the implications of reproducibility for computing research and offer prescriptive recommendations.

A. OVERVIEW OF REPRODUCIBILITY
There is consensus among researchers that empirical results ought to be reproducible but the definition and meaning of reproducibility is not clearly understood [18], [31]- [34].
For this study, we define reproducibility in empirical computing research as: ''the ability of an independent research team to produce the same results using the same research method based on the documentation made by the original research team (adapted from [32]).'' The reproducibility evaluation framework developed by [31], [32] and utilized to analyze reproducibility in artificial intelligence research is the basis for this study. This part of the narrative is largely paraphrased from their seminal work. The key point to emphasize is that a separate group of researchers ought to be able to generate the same findings as the initial researchers primarily using the original documentation. The documentation, therefore, is key to ensuring that VOLUME 10, 2022 the independent team can conduct the exact same research and obtain the same results as the original team [31], [32]. Typical computing research documentation is comprised of three parts: the documentation of the research method that the original research team developed and aims to validate; the data (if any) that is used in the research; and a description of an experiment in text and code form. When the findings of the initial research and those of the reproduced results are similar, one can conclude it is possible to reproduce the initial research.

B. REPRODUCIBILITY DOCUMENTATION
Documentation is the key starting point to reproducibility. To reproduce the results of the research, the documentation must include relevant information and must be specified to a granular level. Researchers must clearly identify what is relevant and how fine-grained the documentation must be to make sure that results can be reproduced using only this information [32]. Following this framework, we also grouped the documentation into three categories: Method, Data and Experiment. The documentation for the research method includes the description of the computing research method as well as its research question [31], [32]. Additionally, data, along with the documentation describing the data and how it can be used, are necessary for reproducibility. Therefore, data engineering and preprocessing are important. The goal is to make available the cleaned exact dataset. Version control is also necessary. Finally, to compare results, the actual output of the research is required [31], [32]. If the research involves conducting an experiment, proper documentation detailing the exact steps involved, including the analysis and results, must be made available [31], [32]. The hardware and software used must be properly specified. While methods and data are required in most research studies, experiments in computing research, more likely dealing with tangible artifacts, are typically more ad hoc. Overall, the extent of documentation in terms of method, data and experiment sits on a continuum of degrees of reproducibility. The 'gold standard' is the ability to share documentation for all three categories in an open and transparent way (e.g., putting everything in a cloud environment) [86], [87]; but the cost of such an infrastructure could be high. Plus, maintenance and updates require ongoing attention.
Following the lead of [31], [32] the documentation factors -methods, data and experiment -enable the definition of the three degrees to which the original results can be reproduced. The degrees are quantified into a numerical score as described in the two Gundersen and Kjensmo papers. R1: Experiment reproducible implies the inclusion of all three factors, and by following the document, independent researchers can reproduce the results; R2: Data reproducible includes method and data and implies the research is potentially a data-driven empirical study. Alternative researchers ought to be able to arrive at similar findings using this documentation; and R3: Method reproducible implies that FIGURE 1. The three degrees of reproducibility (Source: [15], [16]). the method alone is documented, and an independent set of researchers may reproduce the results using this documentation. Figure 1 depicts how the three degrees relate to one another and which degree of reproducibility requires what type of documentation.
Drawing from the literature and basing our research squarely on the adaptation of the model developed and tested in [31], [32], our goal, as stated, is to quantify the state of reproducibility of empirical computing research. We mean to show that the documentation of computing research is not of a high-enough quality to reproduce the reported results, and that the current documentation practices at a top business computing and information systems conference do not support the outcome that reported research results will be reproducible.

III. RESEARCH METHODS
Following [32], [55], an observational study in the form of a manual survey of research papers was conducted to generate quantitative data about the state of the documentation quality of business computing research. Each paper was read several times to extract the values for the variables in each factor. The research papers were reviewed, and a set of 25 variables were manually identified. To compare results among papers and groups of papers, we used three reproducibility metrics -R1D, R2D and R3D -to score the documentation quality. As stated, the research method in this study is adapted, with several modifications, from [31], [32]. (For more details regarding the methodology, please refer to those papers.) Using a data-driven approach, visualization & descriptive analytics [88], [89], well-established methods of analysis, were applied to this dataset of papers to gain insight into reproducibility [88], [90]. The emerging field of visual analytics allows us to graphically represent the data and thereby visualize the results to gain insight [90]- [92]. By integrating a proper design with visual techniques, charts and statistics can be generated [93], [94]. Visual analytics help aggregate, process and represent large amounts of data in easy-tounderstand charts [90], [92], [94]. The overall objective is to tell the stories through visualization [90], [93]. Compared to other types of analytics, descriptive analytics tends to be more data driven; its focus is on describing the data 'as is' with no preconceived assumptions [91]. Descriptive analytics via visualization eases the understanding of historical and current trends to make meaningful decisions [89], [93], [94].

A. SURVEY
To evaluate the hypothesis, we surveyed a total of 125 papers from the 2019 Association for Information Systems (AIS) proceedings of the International Conference on Information Systems (ICIS 2019) (https://aisel.aisnet.org/icis2019/). The ICIS's own description (https://aisnet.org/page/ICISPage) supports our choosing this set of papers: ''The International Conference on Information Systems (ICIS) is the most prestigious gathering of information systems academics and research-oriented practitioners in the world. Every year its 270 or so papers and panel presentations are selected from over 800 submissions.'' Studying a sample of documents from this conference, wherein papers are chosen after a rigorous review process, was deemed appropriate. Because the number of papers under each topic in ICIS 2019 varies, we randomly selected 5 to 11 papers in each topic to maintain a balance of topics and avoid selection bias. As a result, a total number of 19 topics and 125 papers were reviewed. Of these 125, 100 papers comprised empirical research, and 25 were conceptual. A panel of researchers manually classified the papers into empirical and conceptual research types. After dropping the conceptual papers, researchers proceeded to analyze the reproducibility performance of the 100 empirical papers. Table 1 shows the number of published papers (the population size) and the number of surveyed papers (sample size). The ICIS 2019 identified 26 total topics. During data collection, five of the topics were dropped because there were fewer than five papers on each topic. The remaining 19 topics were aggregated into six major topics, as shown in Table 2.
We also analyzed the papers by paper length (full vs. short) and topic (six topics). Figure 2 shows the breakdown of the papers by topic (six topics) and paper length (full vs. short). 'full' indicates that the article is complete, while 'short' indicates that it is just part of the full article. Short papers typically have a length of about 10 pages; full papers run about 18 pages. There is a 50:50 balance of full and short papers in the 100-paper sample reviewed. Of the 100 empirical papers surveyed, most fall under the topics of analytics, data science and smart systems (27%); business models, digital transformation and innovation (26%); and other topics (21%). The distribution among the other three topics-cybersecurity, privacy and ethics of IS (11%), sustainable and societal impact of IS (8%) as well as human computer interfaces (7%)-is relatively small. Note that regrouping the topics caused an imbalance in the number of papers surveyed. While each of the three dominating topics includes more than three sub-topics defined by ICIS, the other three topics include only one or two sub-topics.

B. FACTORS AND VARIABLES
Adapting the process in [31], [32], we treated the three types of documentation, namely Method, Data and Experiment, as the factors specified by 25 different variables. Sixteen of the variables from prior studies were deemed fit for the study of reproducibility in Information Systems research. An additional 12 IS-domain relevant variables were added, for a total of 25 variables. Table 3 shows the factors, variables and their description.
Unless otherwise specified, each variable in Table 3 was encoded as a 1 or 0, where 1 represents an explicit mention of the variable in the paper, and 0 represents no explicit mention. For example, while reviewing the variable 'Goal', each paper was reviewed manually for an explicit mention of the research goal, such as ''Our research goal is. . .'' or ''The goal of the research is to. . .''. Similarly, all variable codes were manually assessed by each researcher for all papers. The codes for each paper were then compared, and any resulting discrepancies were resolved by a combined re-evaluation of the paper in question until a consensus was reached. In this way, an interrater reliability of 90% was achieved. To reiterate, we used the reproducibility metrics from [31], [32] to quantify whether a paper is R1D, R2D, or R3D reproducible, and to what degree.

IV. RESULTS AND ANALYSIS
The data was analyzed using Python for its data preprocessing, descriptive statistics, and correlation analysis capabilities. Tableau, the business intelligence tool, was used to visualize the reproducibility outcomes. We initially present below the descriptive statistics for the metrics and factors. Table 4 presents the descriptive statistics for the three composite reproducibility metrics. R1D is a composite score that covers Method, Data and Experiment, while R2D covers Method and Data and R3D represents the value of Method only. The mean for R3D (0.6657) is the highest, followed by R2D (0.5634) and R1D (0.4256). These outcomes demonstrate that most papers tend to share the documentation for Method only, rather than for all three (including Data and Experiment). Table 5 below presents the descriptive statistics for the three factors measuring reproducibility. The average of Method (0.6657) is the highest, followed by Data (0.4611) and Experiment (0.15). Again, these outcomes suggest that there is a trend for sharing the methodology, which makes methodology more reproducible. Some papers, though deemed empirical, did not conduct an experiment (e.g., an analysis) involving data, which may a least partially explain why Data and Experiment are less reproducible. In addition, data sharing is still challenging for several reasons, including ownership, confidentiality, copyright and competitive advantage. Finally, the experiments may not be sufficiently standardized.     9 for Data, but only 2 for Experiment. Each paper has at least 5 variables for Method, and more than 50% of the papers have more than 10 variables for Method. Each paper also has at least one Data variable, and about 25% of the papers have three or fewer Data variables. More than half of the papers do not show reproducibility for the Experiment factor.  Table 7 presents the descriptive statistics for the variables comprising the factor Method for the 100 empirical papers. The frequency count indicates the number of papers that explicitly mentioned the variable. For example, the frequency count of 86 for 'Goal' indicates that 86 papers mentioned the research goal. Over 90% of the documentation surveyed mentioned the problem statement (97%), research method (93%) and conclusion (94%). Table 8 presents the descriptive statistics of the sample of 100 empirical papers for the variables making up the Data factor. The frequency count, again, represents the number of papers with the specific variable. All 100 papers surveyed mentioned the source of data, whether primary or secondary. More than half of the documentation surveyed provided the model results (65%) and evaluation criteria (57%). Table 9 presents the descriptive statistics for the two variables comprising the factor Experiment for the 100 empirical papers. Only 7% of the documentation shared the method's source code, and only 23% identified the software used for analysis. Table 10 shows the mean score of the three reproducibility factors in each topic. Cyber-security, Privacy and Ethics of IS as paper topics have the highest average Method score (0.7143), Analytics, Data Science and Smart Systems papers score highest in Data (0.5062). Papers in Business Models, Digital Transformation and Innovation provide the highest score in Experiment (0.2115). Table 11 shows the mean value of R1D, R2D and R3D by topics. Analytics, Data Science and Smart Systems (0.4439), Business Models, Digital Transformation and Innovation (0.4462) and Other Topics (0.4478) have the highest R1D score. Analytics, Data Science and Smart Systems have the highest R2D score (0.5917) while Cyber-security, Privacy and Ethics of IS have the highest R3D score (0.7143). Figure 3 depicts three diagrams that spider plot the means for the variables in each of the three factors of Method, Data and Experiment for the sample of empirical papers. Under the Method factor, the problem statement, research method, and conclusion have the highest scores; more than 90 percent of the papers contain these variables. Algorithm, machine learning, and prediction appeared least often. Under the Data factor, data source, evaluation criteria and model results are mentioned most often, and data preprocessing is barely discussed at all. Under the Experiment factor, even though there are only two variables, it appears that the frequency of method source code and software used are below 30 percent, indicating that most papers do not give sufficient details about the experiments to support reproducibility. Comparing the spider plots reveals that the business computing research papers we examined pay more attention to the Method factors, with many variables scoring above 80.   Variables such as problem statement, research method and conclusions, which have scores over 90, are typically given priority in these papers. In contrast, the Experiment variables  score at 20 or less, indicating that experiment details are scant or absent. These findings are understandable: it is relatively more difficult to explain the details of software and code than the details of other aspects of the research. Likewise, typical empirical papers in business computing research are more data-driven, and focus on association or correlation rather than on causality, for which experiments are more appropriate.

B. REPRODUCIBILITY METRICS
The results for the reproducibility metrics appear in Figure 4. These bar charts show the distribution of scores for Method, Data and Experiment, and none of them follow a normal distribution. The charts show the mean values for variables for each of the factors described in Table 3. For example, Figure 4 shows papers usually have a better score in the   papers have an R1D of 0.2 to 0.5, while a few have an R1D in the range of 0.6 to 0.8. According to the analysis by topic for R1D, Figure 5(b) shows that papers in analytics, data science and smart systems, as well as business models, digital transformation and innovation have the highest R1D score, at over 0.44. As indicated by the composite reproducibility score, reproducibility for R1D is not high. The bar charts for R2D (figure 5a) show that the highest frequency ranges are from 0.3 to 0.5, and no papers have an R2D below 0.1. This finding shows that reproducibility performance is much higher when Experiment is not included. For R3D, Figure 5a shows that the highest frequency falls in the interval of 0.6 to 0.8, and almost no papers have an R3D measuring below 0.25. In terms of distribution by topic (Figure 5b) papers in analytics, data science and smart systems have the highest average R2D score (over 0. 59), and papers in each topic have a mean score of over 0.47. Cyber-security, privacy and ethics score the highest in R3D (0.71), followed by analytics, data science and smart systems papers (0.68). This means that overall, the reproducibility performance of Method is better than that of Data and Experiment. And analytics, data science, and smart systems papers usually produce better reproducibility levels than papers in other topics, although the difference is not significant. Figure 6 shows the analysis by paper length (full vs. short). The scatter plot with trend lines shows blue squares representing short papers and orange crosses representing full papers. The X-and Y-axis depict the average scores of Method and Data respectively. The chart shows a high correlation between Data and Method in both paper lengths: also, if a paper performs well for Data, it is likely to perform well for Method, too (p<0.05). Thus, the quality of reproducibility in terms of referring to the Method and Data metrics, shows a significantly positive relationship with the coefficient estimates (0.77 for short VS 0.57 for full) greater than 1. The R-squared for short papers (0.1613) is slightly higher than that for a full paper (0.1066), indicating that for short papers a larger variation in Data scores can be explained by the Method scores. Papers that share details on their method are highly likely to share details of their data, especially for short papers. Figure 7 is a scatter plot that shows the linear association between overall reproducibility -R1D (which is the weighted average of Data, Method and Experiment) and the reproducibility of method and data -R2D (which is the weighted average of Data and Method). The blue square represents short papers, and the orange cross represents full papers, with the size of the square representing the composite reproducibility R1D score. The chart shows that R1D increases as R2D increases for both types of papers, with R2D significantly accounting for more than 67% (p<0.0001, R 2 = 0.6722) of the variation in R1D. In other words, overall reproducibility is largely determined by the disclosure in the data and method sections. Compared to short papers (0.74), R2D of the long papers tend to have a stronger impact on R1D indicated by a higher coefficient estimate (0.93). Most of the highest-scored papers at the top-right corner are long papers. Therefore, papers with high R2D scores always have higher R1D scores, and long papers generally reflect higher reproducibility. Figure 8 is a scatter plot that shows the linear association between method reproducibility R3D (method score only) and overall reproducibility R1D (weighted score of all three). The orange crosses stand for papers with an experiment  setup and the blue circles stand for papers without an experiment setup. The size of the point represents the R1D score. Regardless of experiment setup, R1D goes up as R3D goes up. However, for papers with an experiment setup, the relationship is statistically significant (p<0.0001) and R3D can explain 31% of the variation in R1D (R 2 = 0.3108). On the other hand, papers without an experiment setup do not have a statistically significant relationship between R3D and R1D (p>0.05). Most of the highest-scoring papers at the right corner are papers with experiment setups. Therefore, papers with high R3D (method) scores tend to have high R1D (overall reproducibility) scores, and papers with experiment setups tend to be more reproducible. Figure 9 is a scatter plot that shows the linear association between R2D (the weighted average score of Data and Method) and R3D (score of Method). The blue circles represent papers without data preprocessing, and the orange cross represents papers with data preprocessing. The chart shows that R2D tends to increase as R3D grows, regardless of whether the data preprocessing is shared or not. The coefficient estimates for both with and without data preprocessing are statistically significant (p<0.05). The R-squared for papers without data preprocessing (0.6621) is higher, about 66% of the variation in R2D can be explained by R3D. The coefficient estimate for papers without data preprocessing (0.78) is also higher, indicating that each unit increase in R3D will result in a greater increase in R2D. Therefore, papers without data preprocessing can be made more reproducible by having a more rigorous methodology.  Figure 10 shows a series of box plots for the six groups of research paper topics analyzed. Papers on topics such as human computer interface, as well as the sustainable and social impact of business computing have a lower average R1D (reproducibility for method, data and experiment) and R2D (reproducibility for method and data) scores. The results imply that, for these papers, the data availability is poor and little to no source code or details on methodology are provided in the research literature. Hence, it makes reproducing the experiments harder for some of the papers under these topics. But given the empirical nature of papers in this conference, it is likely most of the research did not require experiments. Figure 11 is a quadrant chart that maps the relationship between the Method and Data scores. The color of each dot represents the average composite R1D score, and the dot size represents the number of papers surveyed for each topic.  The trend line shows a positive association between the average Method score and the average Data score. Documentation of topics with a higher Method score tend to have a higher Data score. Examples include analytics, data science and smart systems. Analytics, data science and smart systems lead in Data, while cyber-security, privacy, and ethics of business computing lead in Method. On the other hand, sustainable and societal impact as well as human computer interface have a below-average Method and Data score, and thereby have a relatively low R1D score. However, business models, digital transformation and innovation are the only topics that tend to share more about data and less about method, while still gaining a high average composite reproducibility score for R1D. Therefore, when publishing their research, researchers should consider sharing more specifics regarding the method and data of their studies to increase the reproducibility. Doing so will no doubt enhance the overall quality of the research. Figure 12 is a quadrant diagram that maps the relationship between Experiment and R1D (Method, Data and Experiment) scores. The color of the dots represents the average composite R1D score, and the size of each dot represents the number of papers surveyed for each topic. The trend line shows a positive association between the average Experiment score and the R1D. Topics such as analytics, data science and smart systems outperformed for both scores. In fact, Other Topics (see Table 2 above) is dominant in R1D, while business models, digital transformation, and innovation are notable in Experiment. Sustainable and societal impact as well as human computer interface remain below-average for Experiment and R1D, and thereby are the least reproducible. To increase the overall reproducibility of the business computing research, a disclosure of the experiment process is very important to consider when publishing the research findings.  Figure 13 depicts a pair of boxplots for the reproducibility metrics R1D (Method, Data and Experiment), R2D (Data and Method) and R3D (Data) for all the papers. Compared to the R1D of short papers (with the mean below 0.4), the R1D of full papers has a higher mean value (above 0.4). The mean values for R2D and R3D are also slightly higher for full papers. The implication here is that full papers tend to have more detailed explanations than short papers and are likely to include more details for Method, Data and Experiment. Hence, full papers are generally more reproducible than short papers. To encourage reproducibility, the authors should consider publishing their papers with fuller content. Figure 14 is a bar chart showing the count analysis for Conclusion, grouped by paper length (full vs. short). According to the figure, the number of papers (with and without conclusions) for full and short papers is the same. Almost all the papers, regardless of length, provide a conclusion for the study.    and Experiment (0.11) factors, and therefore are less reproducible. Short papers do not typically elaborate the details of the data or experiments, nor do they provide the tools for analysis. To increase reproducibility, researchers or publishers are encouraged to publish full papers, including details on the data, method and experiments. Conferences may consider reevaluating the option to submit short papers. Figure 16 is a set of bar charts showing the distribution of the absolute scores for Method, Data and Experiment. The absolute score represents the sum of the variables listed under each of these three factors. It is notable that there are, in total, 14 variables for Method, 9 for Data, and only 2 for Experiment. To examine individual papers more closely, we apply a random rule of thumb. We assume that a factor, to be designated as reproducible, has at least half of the variables present. A paper is method reproducible when it has seven or more variables, data reproducible for more than four variables, and experiment reproducible with at least one variable. For Method, only 5% (2+3) of the papers are not reproducible, and 70% (18+28+17+5+2) of the papers have more than eight variables. Forty-two percent (15+15+6+5+1) of the papers are Data reproducible. Interestingly, 72% of the papers have neither of the two variables in the Experiment factor, indicating that only 28% of the papers have Experiment reproducibility. This finding could be attributed to the fact that most of the papers in the conference are data-driven, not experiment-driven.  There are 25 variables in total representing the reproducibility performance. A paper is defined as reproducible if more than 12 variables across the three factors are present. Sixty-seven percent (12+10+14+10+10+4+4+2+1) of the papers have more than 12 reproducibility variables, and 11% (4+4+2+1) have 18 or more variables. Typically, a majority VOLUME 10, 2022

V. DISCUSSION
Analyzing the reproducibility for the 100 papers, we found that 67 papers, or 67%, are reproducible. As many as 95% of the papers are Method reproducible, 42% are Data reproducible, while only 28% are Experiment reproducible. Many papers in the field of business computing performed well for Method, but further improvements of reproducibility performance can be made for Data and Experiment.
The findings indicate that full papers generally score higher in all the reproducibility metrics, and in all the three factors. This outcome stems from the fact that short papers inevitably cannot provide details in Method, Data and Experiment. To encourage reproducibility, academic researchers should prioritize publishing research documentation in full context, with details explaining their method, data and experiments. Reproducibility varies by topic. Other topics, including healthcare, economics and design science, have high reproducibility. It is also evident that topics such as the sustainable and societal impact of IS as well as human computer interface are the least reproducible and received the lowest scores for all the reproducibility metrics. These topics are emerging, and data availability is limited. It is likely the papers are more case-driven or based on interviews and the like, resulting in qualitative data as sources. In terms of reproducibility by factor, cyber-security, privacy and ethics perform the best in Method; analytics, data science and smart systems lead in Data; while business models, digital transformation and innovation are the topics that lead in Experiment. It must be noted while we adapted the methodology from [31], [32], there are several key differences. While [31] compares academic papers to industry papers published in the period 2013-2016 and is a panel study, this paper focuses only on academic papers that were published during one year. While their studies focus on research in artificial intelligence, this study focuses on business computing/information systems research. Additionally, this study analyzes the data based on more specific topics and delineates the reproducibility differences among the topics. We also developed numerous additional charts to shed light on this rich dataset.

VI. LIMITATIONS
Our research has a few limitations. First, the sample data selected for review was limited because less than 30% of the papers published in ICIS 2019 were reviewed. The papers used for this research likely do not fully represent the entire population of papers, thereby impacting generalizability. In addition to the limited number of papers surveyed, this study is a snapshot in time. Future studies could examine conference papers over time and thereby identify trends. Additional limitations include the validity of the data. Although we cross-validated the results, human errors do occur when conducting manual data collections and survey type analyses. By and large these errors are minimized when multiple teams cross-check individual paper classifications. To reiterate, each paper was read and the reproducibility features were coded by four coders. The four coders were trained in the methodology and checklist template. Any differences in coding were reconciled through discussion and consensus. While the reading itself has elements of subjectivity, this approach is typically used in this type of study. It should also be noted that certain variables, such as algorithm, machine learning, prediction and source code may not be relevant to the overall theme of the conference and papers. Likewise, considering they are data driven and associative, certain studies may not require experiments. It can also be argued that the reproducibility score may depend on the research topic itself. Research topics that are data-based and quantitative in nature, would likely score better on Method and Data. Furthermore, these papers are mostly data-driven research and not about the design of computer artifacts. Therefore, code is not a prominent issue in this study. This study is also a descriptive analytic study of 'data as is' and determining the relative absence or presence of reproducibility. This is not a predictive study attempting to predict the absence or presence of reproducibility. In the future, reproducibility models for qualitative research may be developed. This study looked at reproducibility through a documentarian's lens. However, there are other methods that can independently assess reproducibility, or serve to complement similar studies. Finally, one size does not fit all. In the future, more sophisticated frameworks may be developed to suit the conferences and journals of individual disciplines.

VII. CONCLUSION
Using visualization and descriptive analytics, this exploratory study offers a panoramic view of the state of reproducibility of business computing research. The study paints a mixed picture. While 67% of the surveyed papers appear to be reproducible, this outcome indicates there is significant room for improvement in publishing reproducible papers. Among the three factors of Method, Data and Experiment, none of the papers meet all 25 criteria, leaving much room for improvement. Data and Method are closely associated to one another, as expected, since data is typically utilized in the analysis process. Experiment falls short, but it must be acknowledged that the topics and nature of the conference do not lend themselves well to experiments. Emerging and sharply-focused topics-such as the economics of computing, health information technology, design science and future of work, appear to have better quality reproducibility compared to such other topics as sustainable and societal impact, and human computer interfaces. Because they impact the reproducibility mode one may apply, further research is warranted to help delineate the differences among topics. Also, research in several of the topics is more slanted towards the conceptual. Additionally, we found that paper length (full vs. short) also matters in terms of reproducibility. Full papers with greater documentation are likely to provide more details about the method, data and experiments (R1D), and they are generally more reproducible. It seems an obvious suggestion, but we offer advice to conferences that they accept only full papers and peer reviewed papers; this would improve the reproducibility of the findings, such as those in our study. It is conceivable that the review process would also evolve over time to include more reproducibility-related criteria for evaluation.
From a prescriptive perspective, we offer several recommendations to enhance the reproducibility of computing research. Our framework and check list are starting points as they can be applied both to assess reproducibility before a research study is carried out, and to evaluate a paper or report arising from the research. A significant benefit is the mitigation of the risk of carrying out a research project only to discover it is not reproducible at a later stage. We suggest prospective authors ask themselves the questions given in the check list for their study area. This would be a major departure from the traditional approach of merely making code or data available at journal sites, repositories such as GitHub, validating code post facto, etc. In addition, we suggest that reproducibility analyses be conducted in the context of data governance, ethics, awareness of intellectual property issues, privacy, security, transparency and other issues. There is also an urgent need to continue to build methods, models and tools to conduct studies, both at the paper/project level and at a large-scale macro level, for example, to assess the reproducibility of entire sets of conference papers. These are dauting tasks since we know from the literature review there is an eclectic group of models and tools across the broader scientific disciplines and the more specific sub-fields of computing, and at the same time, one model may not fit all research situations. We would be remiss if we did not mention the need for additional research into the validation of the reproducibility methods themselves. While studies, including this one, are emerging to examine the presence of reproducibility, there is a dearth of 'how-to' mechanisms. This gap must be addressed. Though there is an increased awareness for the need for reproducibility, better communication of the benefits of rigor in computing research, and the risks and consequences of a failure to reproduce or repeat/replicate research findings, is needed so the larger benefits of computing and technology research can be harnessed.
Reproducibility in general, and in business computing research specifically, is at a critical stage of development but increased awareness and advances in reproducibility methods and tools can accelerate the maturing process.