The PETSc Community as Infrastructure

The communities that develop and support open-source scientific software packages are crucial to the utility and success of such packages. Moreover, they form an important part of the human infrastructure that enables scientific progress. This article discusses aspects of the Portable Extensible Toolkit for Scientific Computation community, its organization, and technical approaches that enable community members to help each other efficiently and effectively.

T o meet the technological challenges of the 21st century, the simultaneous revolutions in data science and computing architectures need to be mirrored by a revolution in scientific simulation that provides flexible, scalable, multiphysics multiscale capabilities in both traditional and new areas. This simulation technology rests on a foundation of numerical algorithms and software for high-performance computing. This foundation raises in importance to the level of classical hard infrastructures but it requires human investment and new ways of organizing the effort for software and algorithm development, support, and maintenance.
Much simulation technology today is developed and supported with a community a open-source software paradigm. 1,3,5,9 Many numerically oriented open-source projects, including SciPy, Julia, and Stan, thrive because of their communities; those without a community die out, have only a fringe usership, or are maintained (as orphan software) by other communities.
In addition, an explicit focus on software ecosystems-collections of interdependent products whose development teams have incentives to collaborate to provide aggregate value-is addressing growing HPC complexity. 6 Notable efforts include the xSDK, where community policies b are helping with coordination among numerical packages, and E4S, c a broader effort addressing functionality across the HPC software stack. This article presents an explanatory case study of the Portable, Extensible Toolkit for Scientific Computation (PETSc), 2 considering community as infrastructure. PETSc began in the early 1990s at Argonne National Laboratory as a project for research on parallel numerical algorithms. Since then, developers, users, and functionality have grown substantially, driven by continually expanding community needs to exploit advances in HPC architectures for next-generation science fully. The authors have over 160 years of combined experience with PETSc; their training ranges from mathematics to computer science to science and engineering. One author also supports PETSc at a supercomputing center; another maintains the testing and merge request infrastructure. Several authors are liaisons with other software communities. PETSc comprises software infrastructure (code and tools) plus human infrastructure: the community of people who develop, support, maintain, use, and fund PETSc, their interactions, and their culture. The human infrastructure-people and their interactions as a community, within the broader DOE, HPC, and computational science communities-is foundational and enables the creation of sustainable software infrastructure.
PETSc was not originally purposefully designed to support long-term community software infrastructure. Rather, work on the software inspired the creation of a set of practices to enable a small development team with large ambitions and a long time horizon to develop and support software capable of solving problems of interest to the developers and their collaborators. However, these practices, reviewed in the following, have wider benefits, and certain community properties could serve as a template for long-term software infrastructure: › committing to continually advancing library capabilities as needed by next-generation science and HPC architectures.
Spread throughout the world, the PETSc community allows the real-time transfer of knowledge across institutions and application fields. Also, community interactions promote algorithmic development, enable stateof-the-art advances, and benefit the scientific community.
We organize this article as follows. In the next section, we discuss the purposes of the PETSc community and the various roles that members play. After that, we introduce several key organizational principles and communication patterns. We then introduce the paradigm of debugging by e-mail, which encapsulates the philosophy and software technologies we use to help each other (regardless of location) use, debug, and improve PETSc. This section shows how technical choices in design and software details can be made specifically to enhance the community experience.

COMMUNITY
This section outlines the myriad purposes of the PETSc software and its roles within the PETSc community. First, of course, the PETSc community (similar to other package communities) is embedded in the DOE, HPC, and broader computational science communities.

Purposes of PETSc
PETSc serves many purposes as a software library that connects research in applied mathematics to usage within applications in science and engineering. These include: › a research platform targeting innovative algorithmic development; › a well-supported HPC library; › a repository of template applications via a wealth of example codes; › a compendium of algorithms, with an algorithmic management system that provides concrete, scalable implementations of a wide range of methods described in the applied mathematics literature; › an application development framework; › a pedagogical tool for training numerical analysts on HPC platforms 4 ; › a source of best choice numerical methods in its role as an interface between academic algorithmic development and the needs of users in science and engineering; › an extensible interface to complementary HPC software, such as SuperLU and hypre.

Roles of PETSc Community Members
Virtually all active PETSc community members are PETSc users; a smaller subset of these, who often began as PETSc users, are also PETSc developers. All PETSc users provide important contributions, including bug reports, bug fixes, improved documentation, and suggestions for new features. Individuals often move between different roles in the PETSc community. There is a "long tail" who contributes less frequently than the most active developers, yet collectively contributes a great deal. Over one hundred people are contributors to the PETSc Git repository; hundreds more communicate through e-mail and GitLab issues each year, and thousands use PETSc directly or via other toolkits.
The community structure is crucial in providing a pathway to increasing involvement for interested users. One way is to characterize PETSc community members along institutional lines.
› Academic-oriented users are students, faculty, and staff focusing on research and development, employed by universities and research laboratories. Students may use PETSc to do homework or develop a paper or thesis code. Students often contribute code back to PETSc, and, as they graduate, bring PETSc to their new institutions. Some students have become PETSc developers.
› Industrial users employ PETSc in their company's research or commercial products. PETSc's 2-clause BSD license eases its commercial use. These users may request support that is unlikely to be funded by research grants, such as support for Microsoft Windows; fortunately, there are avenues for PETSc community members to help with these requests. Industrial users require discretion and confidentiality. They cannot always share their use cases, so we must provide general solutions without details of the specifics. Developing the trust needed for industrial users is a gradual process whose importance must be recognized by both sides.
Another way to categorize PETSc community members is by the goals of their work, as shown in Figure 1.
› Algorithm developers focus on devising and analyzing algorithms and hence may be less concerned about generality and usability. They use PETSc because it provides infrastructure for HPC architectures, allowing them to avoid unnecessary coding. Algorithm developers face the challenge of writing scalable implementations, which, even in PETSc, can be time consuming with a steep learning curve. However, such people find the benefits of using PETSc, including the broad impact of their work on HPC applications, outweigh using packages with a less steep learning curve, such as MATLAB.
› Scientific toolkit developers build systems that tackle a subset of PETSc functionality but at a higher level of abstraction, with more specific support for their target class of problems. Such toolkits, including Firedrake, MOOSE, and Deal.II, leverage PETSc capabilities and introduce additional infrastructure.
› Application developers focus on creating a code that addresses one specific simulation. They are often discipline scientists or engineers, who benefit from performance enhancements provided through PETSc composability, where upgrades in algorithms and data structures can occur seamlessly from a user perspective, yet provide significant performance increases.

ORGANIZATION AND COMMUNICATION
We now summarize the organizational and communication patterns in PETSc due to its various purposes and member roles. Besides communicating within funded projects, institutional settings, and events, users engage in the PETSc community through online support mailings, GitLab issues, and Slack channels. Annual PETSc user meetings include tutorials on leveraging library functionality for research while also highlighting users' science achievements made possible by advances in PETSc features.

Engagement When Problems Occur
Support is a crucial aspect of a healthy software community. Members of the PETSc community usually respond to requests within hours, if not minutes. This engagement helps new users feel welcomed and valued, make rapid progress, and gain confidence. Through users' feedback, developers learn what works, what does not, and where improvements are needed. PETSc community members discover new research topics from feature requests and discussions.
However, providing excellent support is a substantial effort, particularly when users encounter difficult bugs or performance issues at scale. User-developer communication on a particular topic can span weeks or even months. PETSc developers need to be patient and consistently engaged with users. One may question whether this practice is sustainable, but it has worked reasonably well. In the last section, we discuss some technical approaches to providing support.

Trust Within the Community
To maintain the vitality of the library, new algorithmic developments must be rapidly integrated, bugs promptly fixed, and awkward constructions removed. These activities require the PETSc community to establish a high level of trust, communicating that the library will be well supported even in the face of rapid evolution, and that code will continue to run with help from the community. Members of the PETSc community have a wide range of professions, backgrounds, and levels of involvement, with individuals often participating in several ways over the years. Engagement is key to disseminating tacit knowledge and developing users' skills and social support so that people can transition to become developers and mentors. The PETSc community has developed a broad base of people with expertise and kindness to reduce and report bugs, mentor newcomers, and contribute in other ways. To help improve the atmosphere, which can be difficult for newcomers, the PETSc community has adopted a code of conduct. e

Community to Community
Application communities often treat PETSc as a software ecosystem instead of a stand-alone package. As a result, they rely on PETSc to manage necessary low-level tools, such as MPI, BLAS/LAPACK, and vendor packages used on accelerators. Application communities also appreciate unified solver interfaces, particularly linear preconditioners, which enable application codes to access third-party libraries, such as MUMPS, SuperLU, and hypre with little effort. Often "technical language" barriers exist between communities. For example, an expert in contact mechanics may describe a solver convergence issue as "we found a PETSc error when contacts occur with a frictionless model." A solvers expert might find it hard to resolve such an issue. Fortunately, other PETSc community members may have domain expertise and can serve as liaisons between communities. Such individuals speak the languages of both communities, understanding both PETSc's capabilities and the needs of their communities. Thus, they can explore, explain, and introduce PETSc features to their communities. Such liaisons help expand PETSc's reach across disciplines and reduce the centralized maintenance burden by addressing many questions directly in their communities while contributing patches, feature requirements, and even serving as testers of software releases.
An important PETSc subcommunity is systems engineers who manage institutions' computational infrastructure. They often are the first to encounter problems that need the attention of PETSc developers. Their expertise can help rapidly debug problems and develop fixes. Package maintainers, for example, for APT, are also a valuable resource, as they track PETSc on particular configurations; they often have excellent suggestions for improvements to PETSc's configuration and installation.

Responding to Change
Communities must respond with innovative and creative solutions to changing circumstances. For numerical software, this includes the continual emergence of new science drivers and techniques, currently data science and artificial intelligence, as well as new hardware architectures. For example, a large shift in HPC is underway with incorporating graphical processing units (GPUs) into scientific computing. Major organizations, such as DOE have responded with, for example, the ECP, where community open-source projects, including PETSc, are aggressively developing innovations in data structures and algorithms for new architectures, 7,11 The PETSc community empowers developers to be creative by providing the autonomy to be innovative while still maintaining guidelines for development f and town squares to organize overall development plans. This approach, along with PETSc's wide variety of contributors, enables a level of agility that might not otherwise occur. This approach also promotes project-specific planning (for example, as needed for work proposed and funded in particular e htt_ ps://gitlab.com/petsc/petsc/-/blob/main/ CODE_OF_CONDUCT.md f htt_ ps://petsc.org/release/developers grants) and coordination among development communities overall.

Enabling Research Collaborations
PETSc's community helps members identify funding opportunities, access expertise, and transition between roles in the project. One of the greatest difficulties in maintaining a coherent software project over decades is providing career paths for contributors. PETSc gives academic contributors a solid foundation for advancement via awards (e.g., SC Gordon Bell prizes and SIAM prizes), the highly cited users' manual, professional recognition, and productive collaborations born from PETSc development, maintenance, and support. In addition, PETSc provides resource sharing from collaborative grants and collaboration opportunities that extend beyond the development group. Sometimes, PETSc affiliation may be more important than departmental affiliation, especially since modern academic departments are often atomized, with little internal collaboration. The community provides strong academic connections for industrial and laboratory members, tangible outputs recognized by future employers, and active participation in the wider computational science community.

Engagement With Funding Institutions
Even the smallest community open-source projects cannot exist without some funding and institutional support. Usually, it is a combination of grants from governmental or nongovernmental agencies, in-house funding within particular institutions, and less formal systems that allow employees to contribute to open-source packages during a portion of their regular employment. Members of the PETSc community are actively engaged with program development, including communicating with program managers at the U.S. Department of Energy, the National Science Foundation, and with institutional management to ensure that support is provided and maintained. This approach enables the community to do more with less. In this section, we introduce the debuggability design of PETSc, which is designed and implemented to improve engagement and support. PETSc's debuggability spans from configuration to code execution, with the salient features highlighted in Figure 2. Our commitment to software support leads us to maintain our own simple, integrated tools for many tasks that conventional software wisdom would dictate should be performed exclusively by full-featured external tools. The combination of internally developed and external tools used in PETSc is unique to its history and "high-end" HPC focus and is not necessarily the best approach for other open-source packages.

Configuration Debugging
HPC software systems have complex execution environments, including great variance in hardware and software. Software developers must expend considerable effort to configure and build their code for different situations. When failures occur, the software developers need to have information available in a usable format to diagnose and fix the problems. Configuration failures are the most common support issues the PETSc community faces. Thus, having a debuggable configuration system with comprehensive logging is critical. Rather than using a standard configuration system, such as GNU Autotools or CMake, PETSc, has a bespoke configuration system with extensive checking, written in Python, which logs everything during configuration in a single file configure. log. When a check fails, it generates clear error messages and a Python stack trace. PETSc users attach configure. log and another file make.log generated by make when they meet configuration errors. By examining the two files, PETSc developers can quickly determine why the configuration system made specific choices and what went wrong. In addition, since the configuration system is bespoke, the PETSc community can easily add new checking, testing, and logging. CMake is notoriously difficult to debug by e-mail because it logs information in various directories and does not log much of its process.

Runtime Error Debugging
The PETSc library strives to provide descriptive error messages that explain why and where errors have occurred, making it easy for PETSc developers to diagnose by e-mail what went wrong and assist users with fixes. PETSc has extensive code to assist in this regard. See Listing 1, which shows code that adds two vectors with y += alpha*x. Every PETSc function returns a PetscErrorCode, indicating whether the function is successfully executed, and if not, what error occurred. In PETSc source code, every function call is error checked, as in lines 12-14. We make errors manifest early rather than later to avoid obscure error messages. The default error handler prints the stack trace leading to the error, including function names, file names, and line numbers. The stack trace is built inside the two macros PetscFunctionBegin and PetscFunc-tionReturn(0); see lines 3 and 16. PETSc also provides utilities to check the integrity of function parameters; see lines 4-10. All application programming interfaces (APIs) shown here are public; users are encouraged to apply the same strategy in their code.
PETSc also has APIs to assert properties of the code so that useful error messages are generated promptly if the code behaves unexpectedly. For example, once a matrix is preallocated or assembled, one can set a property of the matrix to indicate that in subsequent insertions one will insert only to existing nonzero locations.

Memory Debugging
Memory corruption problems are common; therefore, memory allocations are done through a PETSc-specific API that records information and sets sentinels around the allocations in debug mode. With command-line options, PETSc will initialize the allocated memory with not a number; using the uninitialized memory in floating-point operations will generate an appropriate error message. Also, PETSc can check the integrity of the entire heap of PETSc-allocated memory at every allocation. PETSc codes also can output information about memory that has never been freed during the PETSc finalization stage, to detect memory leaks. As lightweight Valgrind-like features, the output can be shared with PETSc developers to help understand a code's misbehavior. But, of course, we also recommend using more sophisticated tools, including Valgrind and debuggers.

Performance Debugging
Another challenging support task, which the PETSc community also handles routinely, is debugging performance problems, particularly for high levels of parallelism. PETSc provides APIs to allow developers and users to set stages in their code and log the performance of events of interest. For example, lines 12 and 14 in Listing 1 are for the VEC_AXPY event, rendering a lightweight, integrated logging system that allows users to quickly gather timings. Listing 2 shows a snippet of the stdout output. From the top, we know that the computation has two stages, labeled as setup and solve. PETSc summarizes the computation and communication statistics of the two stages. Below that, it lists detailed statistics of events (functions) within each stage (only the first stage is shown), including the number of times an event has been called, time and floating-point operations (flops) an event has spent, MPI messages and reductions an event has incurred, and other statistics. Interested readers are referred to the PETSc/TAO Users Manual 2 or hints in the log view message itself. Because MPI processes have different statistics in parallel, PETSc shows maxima overall processes and ratios of maxima to minima. This information is useful because whenever we encounter a large ratio in time or flops in output, we know a load imbalance in the corresponding event might exist. Sometimes, imbalance in one event can distort the timing of other events (for instance, processes might wait for messages from a lagging partner), giving confusing results.
Having the profiling tools integrated with the numerical algorithms in use, outputting by default to stdout, is crucial because it allows all users to provide quickly information on their usage, independent of what computational systems they may use or which additional analysis or logging tools they have available. This allows PETSc developers to quickly and directly view timings on the user's system and facilitate performance debugging of scalable solvers at "production" scale, by e-mail, where direct reproduction of a user's issue is infeasible.

Algorithm Debugging
PETSc includes an extensive suite of parallel preconditioners, linear solvers, nonlinear solvers, and time integrators. Composable and nested solvers are among the most powerful PETSc features since they facilitate numerical experimentation on a novel, complex problems, but keeping track of them can be difficult. PETSc developers must see the detailed solver configurations to spot potential problems. Hence, PETSc provides APIs to display all solver options being used. Listing 3 shows a snippet of a longer output from a nonlinear solver. With indention reflecting levels of the composite solvers, we can see the nested solvers used and key parameters employed at various levels of the solvers. PETSc provides flexible monitors to be used with solver views, which print the residual or function norm at each iteration of an iterative solver so that users can check the convergence of the solver and compare different algorithms. Listing 4 shows the output of nonlinear and linear solver monitors.
In summary, PETSc offers a wide set of complementary options to aid debugging by e-mail with the following common themes: users can enable debugging regardless of their computing environment; errors appear as early as possible; and the output is printed in well-formatted plain text for copy-and-paste or file attachments to e-mails and GitLab issues.

CONCLUSION
The increased prominence of data science and the transition to computing architecture heterogeneity require more, not less, high-quality numerical simulation and analysis software. This software is often created in community open-source environments; the  communities are crucial to the utility of such software. We have outlined some aspects of the open-source PETSc community and its collaboration strategies. Most of what was discussed apply to other numerical software communities. We concluded by focusing on mechanisms we use to allow community members to efficiently help one another at a distance using straightforward communication channels. The science and engineering of scientific software communities are only just beginning, and this topic is starting to receive more consideration at institutional levels. By sharing some of the PETSc community approaches, we hope to contribute to the wider scientific computing community as it seeks to improve the software programming process.