The Double-Edged Sword of Diversity: How Diversity, Conflict, and Psychological Safety Impact Software Teams

Team diversity can be seen as a double-edged sword. It brings additional cognitive resources to teams at the risk of increased conflict. Few studies have investigated how different types of diversity impact software teams. This study views diversity through the lens of the categorization-elaboration model (CEM). We investigated how diversity in gender, age, role, and cultural background impacts team effectiveness and conflict, and how these associations are moderated by psychological safety. Our sample consisted of 1,118 participants from 161 teams and was analyzed with Covariance-Based Structural Equation Modeling (CB-SEM). We found a positive effect of age diversity on team effectiveness and gender diversity on relational conflict. Psychological safety contributed directly to effective teamwork and less conflict but did not moderate the diversity-effectiveness link. While our results are consistent with the CEM theory for age and gender diversity, other types of diversity did not yield similar results. We discuss several reasons for this, including curvilinear effects, moderators such as task interdependence, or the presence of a diversity mindset. With this paper, we argue that a dichotomous nature of diversity is oversimplified. Indeed, it is a complex relationship where context plays a pivotal role. A more nuanced understanding of diversity through the lens of theories, such as the CEM, may lead to more effective teamwork.


I. INTRODUCTION
Teams are increasingly important to organizations.This is particularly relevant with the rise of Agile software methodologies, which 80% of organizations now predominantly employ for their software teams [1].Agile represents a collaborative, iteration-based, and human-oriented approach to product development [29].Many scholars have attempted to identify the factors and characteristics that influence team effectiveness [33], [99].One factor that has gained increased attention in recent decades is team diversity [44], also in software engineering specifically [52].Team diversity is generally defined as heterogeneity in member attributes, such as age, gender, cultural background, tenure, role, or personality traits [49].
While teams can be diverse on many attributes, most studies focus on demographic diversity (e.g., age, gender, cultural background) or informational diversity (e.g., professional role, education, experience).Many researchers have theorized that diversity improves team performance [43], [51], [52].However, studies have provided mixed support.Investigations of how C. Verwijs is with The Liberators, The Netherlands.D. Russo is with the Department of Computer Science, Aalborg University, Denmark.Corresponding author.Email: daniel.russo@cs.aau.dkManuscript received Month 01, 2023; revised .... diversity impacts teams [45], [53], [44], [54], [51], [52] generally show that the effects are not clear-cut, vary by type of diversity, and appear to be moderated by characteristics of the task, the team, and its environment.However, diversity may also negatively impact effectiveness through an increased conflict between members [44], [45].Several competing mechanisms and integrated models have been proposed to explain these conflicting results [55], [62], which are discussed in Section II.Specifically for software engineering, Silveira & Prikladnicki [52] and Rodríguez-Pérez, Nadri & Nagappan [94] concluded from literature reviews that our understanding of diversity in such teams still needs to be improved.They found that most studies have only investigated gender diversity [52] and argue that a broader exploration of how diversity impacts software engineering teams can be used to create better teams and better results.A conceptual way to look at this is through the lens of team effectiveness.Hackman [33] defined this as the degree to which the outcomes of a team satisfy the expectations of those they work for, as well as its members.Verwijs & Russo [39] recently operationalized team effectiveness for Agile software teams through stakeholder satisfaction and team morale.
Studies have yet to explore how diversity affects software teams and their effectiveness.A more comprehensive examination is vital to understand how to design more effective teams and achieve better results.Henceforth, our research question (RQ) is: RQ: How does diversity in software teams impact their effectiveness?
To answer our research question, we performed a quantitative cross-sectional study with 1,118 team members representing 161 software teams.Covariance-Based Structural Equation Modeling (CB-SEM or SEM in short) was used to test how four types of diversity (gender, age, cultural background, and role) and one social moderator (psychological safety) interact to impact team effectiveness and conflict in teams.Only age diversity was positively associated with team effectiveness.Concerning relational conflict, only gender diversity showed a significant positive association.A replication package is also openly available on Zenodo to support secondary studies.
The rest of the paper is structured as follows.In Section II, we review the related works of team diversity and how it impacts team outcomes.Subsequently, we clarify the research gap this study intends to address and develop relevant hy-potheses in Section II-D.Section III clarifies how we use quantitative methods and a survey study to test our hypotheses.The study results are reported in Section III-C, followed by a comprehensive discussion of the results and their implications in Section V. Finally, we conclude our paper outlining future research opportunities in Section VI.

II. RELATED WORK
Scholars from several disciplines have shown mixed results regarding how diversity impacts team effectiveness.We note that most of these studies have investigated team performance.While team effectiveness is often used interchangeably with team performance in literature, including by Hackman [99], they are not necessarily the same.Some definitions put more emphasis on speed and volume of output [101], whereas others emphasize creativity [46] or learning [103].However, most definitions share that teams are able to produce high-quality outcomes.
Tshetshema & Chan conclude from a review of 35 studies that "a negative relationship between [demographic diversity] and team performance is inferred as the most reported result."[45, p. 9].However, they note that investigations of individual dimensions of diversity often show a positive effect on team performance, particularly gender and age.The complex relationship between diversity and performance is also recognized by Patrício & Franco [44].They argue from a review of 80 studies that diversity has a dual impact on performance.One is positive through expanding perspectives, and the other is negative through increased conflict.Bowers, Pharmer & Salas [53] performed a meta-analysis of 13 empirical studies and found the effects of team diversity on team performance to be dependent on task complexity and difficulty instead.Their results suggest that teams that perform tasks of low complexity may benefit more from homogeneity, whereas teams that perform complex tasks benefit from higher diversity.Another meta-analysis of 30 empirical studies by Horwitz & Horwitz [54] found no significant effect of demographic diversity (age, gender, or cultural background) on team performance but did find a significant moderate effect of role diversity.
We now turn to investigations of individual dimensions of diversity in teams commonly studied by scholars.

A. Diversity dimensions
For age diversity, Tshetshema & Chan [45] found a positive effect on team performance in a review of empirical studies.However, a meta-analysis of 74 empirical studies by Schneid et al. [58] did not show a significant relationship, although modest differences occurred as a result of moderators like task complexity and team.Pesch, Bouncken & Kraus [57] attribute the positive effect of age diversity primarily to differences in tenure and work experience rather than age itself.They also note that this diversity is likely to increase tension and conflict in teams as members have to reconcile more diverse perspectives on completing tasks.
Cultural diversity is defined as heterogeneity in shared beliefs, norms and values [59].It is often operationalized through surface-level ethnic or national diversity [45].Tshetshema & Chan [45] inventoried studies that investigated the link between cultural diversity and team performance and inferred that a positive relationship is the most reported result.However, the relationship appears curvilinear: moderate cultural diversity is beneficial, but too little or too much adversely affects team performance [59].
Scholars define gender diversity as heterogeneity in the gender of team members.Most studies suggest a positive relationship with team performance [45], [43].Nevertheless, too much diversity may lead to increased conflict, particularly for complex tasks and high interdependence.Thus, Haas & Hartmut [60] argue that gender diversity should be avoided in such environments.
Role diversity is another dimension of diversity in teams that is frequently studied.It represents the heterogeneity in the functional disciplines and roles members bring to a team [61].Agile software methodologies in particular emphasize the need for role diversity in teams in order to solve complex problems [50], [29].Empirical studies have shown mixed results, with some demonstrating positive effects and others negative [62].Pelled, Eisenhardt & Xin [61] found that role diversity increases conflict due to the integration of diverse perspectives, which positively influences task performance.Homberg & Bui [63] found no significant effect of role diversity on the performance of management teams in a meta-analysis of 53 empirical studies.Instead, they attribute the mixed findings to publication bias where those studies that don't report significant effects are published far less often than those that do.Horwitz & Horwitz [54] did find a modest effect on the quality of the work produced by teams in another meta-analysis, though not on performance.
The empirical link between diversity and team performance appears to be complicated.Several moderators have been found to strengthen the positive impact or dampen the negative impact, such as an inclusive team climate [46], task complexity and difficulty [53], [54], psychological safety [14], management support [47], or time [48].

B. Diversity in software teams
The importance of team diversity has also been recognized for software teams specifically [94], [50].The assumption is that diversity allows for a richer exploration of shared problems due to the availability of more perspectives [51], [52].This is particularly relevant to the complex problemsolving in software teams, which requires creativity and the application of diverse skill sets [64].Several studies have investigated whether this assumption holds up in practice.Lee & Xia [41] used a mixed-methods approach to investigate how role diversity and team autonomy influence the ability of (Agile) software teams to deliver on budget, on time, and within scope.They found a significant positive effect of diversity in a survey study of 399 software projects and follow-up case studies.However, they found that role diversity improves the quality of solutions emerging from problemsolving in teams, but not speed.They also found evidence for the dual impact of diversity, where diversity also increases conflict.Melo et.al. [42] performed a multiple-case study of software teams in three large Brazilian software companies.Their results suggest that teams are more productive when there is diversity in the experience that members bring to the team.Another study by Russo & Stol [43] surveyed 483 software engineers to investigate how personality and gender influence the productivity of software teams.Their results show that men and women typically bring different positive and negative traits to teams, and they argue that this explains some part of why mixed-gender teams perform better.Rodríguez-Pérez, Nadri & Nagappan [94] conclude from a literature review that gender differences between developers contribute significantly to how they solve problems, debug issues, and work with others.The authors also note that gender diversity is most frequently studied, but much less is known about how other types of informational and demographic diversity affect software teams.A similar conclusion is reached by Silveira & Prikladnicki [52] in a review of the literature on diversity in (Agile) software teams.Thus, both groups of authors call for more research to guide decision-making on how to design better teams and generate better results.

C. Theories and moderators of the diversity-effectiveness link
Two mechanisms have been proposed by which diversity influences team effectiveness [55].The similarity-attraction paradigm [49] derives from social psychology and social categorization to argue that similarity between members increases mutual attraction, integration, and communication, which in turn improves effectiveness.Diversity of members, on the other hand, results in more conflict and misunderstandings as people categorize themselves into different subgroups.Jehn [113], [78] conceptualizes such conflicts as "relationship conflict" and distinguishes them from "task conflict".Where task conflict involves disagreement on how to proceed with the work at hand, relational conflict emerges as interpersonal friction from differences in values, political preferences, personal tastes, and interpersonal style.While relational conflict clearly negatively affects team effectiveness [78], [114], some level of task conflict is often thought to be useful as it encourages deeper information processing [115].However, a meta-analysis by De Dreu & Weingart [116] does not support that distinction and suggests that even low-level task conflict is detrimental to team effectiveness in most cases.
Another mechanism that explains how diversity influences team effectiveness is cognitive resource diversity theory.It derives from cognitive psychology.It treats teams as information processors where individuals process information and then elaborate and integrate it as a team [87].In this conceptualization, diversity allows teams to bring varied cognitive resources to bear when information is processed individually and elaborated as a team, which allows a richer exploration of shared challenges.
Thus, both mechanisms offer conflicting predictions about how diversity will impact team effectiveness.The former expects relational conflict to increase and effectiveness to decrease, whereas the latter expects effectiveness to increase.However, the evidence mentioned above does not consistently support one or the other.So the focus of academic inquiry has shifted toward identifying potential moderators that allow both mechanisms to be integrated [62], [67], [54], [55].
One potential group of moderators concerns task characteristics, like complexity and interdependence [55].In this view, homogeneity benefits low-complexity tasks with few interdependencies, whereas heterogeneity benefits more complex tasks with many inter-dependencies.This is primarily consistent with findings from meta-analyses of the diversity-effectiveness relationship [53], [58].However, other studies have found both positive and negative effects of task interdependence on the relationship between diversity and effectiveness [62].
Another potential moderator is psychological safety.Edmondson [14, p. 9] defines it as "a shared belief held by team members that the team is safe for interpersonal risktaking".Several studies have already shown that psychological safety contributes to more effective teamwork in software teams [39], [5], [91], [81].However, psychological safety is also likely to moderate the relationship between diversity and team effectiveness.Diegmann & Rosenkranz [79] theorize that psychological safety makes teams more resilient against the disruptive effect of high diversity, such as increased conflict, by providing a safe environment for members to elaborate task information.Similarly, Roberge & Van Dick [80] expect that psychological safety also interacts with the salience of a collective identity.Diversity only contributes to higher team effectiveness when members feel safe and identify strongly with their team.
To date, few studies have empirically investigated the role of psychological safety as a moderator of the diversityeffectiveness association.Singh, Winkel & Selvarajan [82] found that employee performance was higher among members of diverse teams that also exhibited high psychological safety.However, this study was limited to one organization and only considered racial diversity.Furthermore, Kirkman et al. [85] found that Communities of Practice (CoP) performed better when diversity was paired with high psychological safety.Virtual teams also experience fewer drawbacks from diversity when they can elaborate information in psychologically safe environments [86].
Van Knippenberg, De Dreu & Homan [62] have proposed the categorization-elaboration model (CEM) to integrate the double-edged nature of diversity in teams and potential moderators.The CEM is the most comprehensive model of work group diversity and its moderators at the time of writing and has received broad empirical support [83], [13], [67], [74].It distinguishes between moderators related to the task, like difficulty, complexity, and efficacy, and moderators related to the team and the social processes in it, like trust and commitment.Both groups of moderators influence the ability of teams to leverage the informational advantage offered through diversity, though in different ways.In the case of task moderators, complex and challenging tasks are more likely to elicit extensive information processing in members [54], [56], which is consistent with cognitive resource diversity theory.The motivation of teams with their task has also been shown to positively moderate the effects of diversity on information processing [72].Another potential task moderator is task interdependence, which is generally defined as the degree to which the completion of tasks requires collaboration by team members [66].Teams with low interdependence see less interaction and thus experience fewer opportunities to leverage the benefit of diverse cognitive resources.However, empirical studies have found positive and negative effects of task interdependence on the relationship between demographic and role diversity and team effectiveness [67].This suggests that the effect is either not linear or subject to other moderators.
At the same time, the CEM also proposes a mechanism by which diversity can harm teamwork.As members grow less similar and bring different perspectives to teamwork, this diminishes effectiveness when the social context of a team encourages social categorization into subgroups and elicits negative inter-group biases and identity threat [67], [49].This loss of social integration creates more potential for relational conflict and negatively impacts the ability of teams to elaborate information effectively and reduces their effectiveness.However, social moderators like trust and psychological safety allow team members to integrate more effectively to bring diverse perspectives and information-processing together and elaborate on them, which is consistent with the similarityattraction paradigm.
A strength of this integrated approach is that it may explain the conflicting results found in the literature.The different mechanisms behind both groups of moderators independently strengthen or diminish the ability of teams to leverage diversity and can work in concert or in opposition.Thus, the CEM broadens the discourse around team diversity from a onedimensional approach where it is either a risk or an asset to one where it can be both simultaneously.Finally, the CEM has clear, practical implications for diversity management that aim to reduce in-group bias, strengthen social moderators, and match diversity with the nature of the task [62].

D. Research Gap & Hypotheses
This study aims to address two related research gaps.The first is that we want to answer the call by Silveira & Prikladnicki [52] and Rodríguez-Pérez, Nadri & Nagappan [94] for more investigations into how diversity affects software teams, and not limited to only gender diversity.A more comprehensive examination is vital to understand how to design more effective teams and achieve better results.The second research gap is that we want to investigate diversity in software teams through the lens of the CEM theory and its opposing mechanisms.
To answer our research question, we will now develop seven hypotheses we aim to test in this study.Our first hypothesis is that diversity contributes to the effectiveness of software teams.Because such teams collaborate on complex and interdependent tasks [64], [84], [50], they should benefit from the expanded cognitive resources allowed by heterogeneity in gender, age, cultural background, and role.This reflects one mechanism by which diversity influences team effectiveness and is in accordance with both cognitive resource diversity theory and the CEM that integrates it.Hypothesis 1 (H1).Software teams are more effective when they are more diverse in gender (H1a), age (H1b), cultural background (H1c), and role diversity (H1d).
Our second hypothesis concerns the second and opposing mechanism of diversity.That is, we expect that increased diversity also results in more relational conflict in teams.This hypothesis reflects a core consequence of the similarityattraction paradigm and the CEM that integrates it.Hypothesis 2 (H2).Software teams experience more relational conflict when they are more diverse in gender (H2a), age (H2b), cultural background (H2c), and role diversity (H2d).
Furthermore, we hypothesize that the increased relational conflict, in turn, negatively impacts the effectiveness of teams.This is consistent with the outcome expected by the similarityattraction paradigm and the CEM that integrates it.Hypothesis 3 (H3).Relational conflict reduces the effectiveness of software teams.
Following existing literature [39], [5], [14], [36], we expect that psychological safety is a critical factor in enabling team effectiveness through four different processes.The first involves a direct effect where psychological safety makes teams more effective by creating more opportunities to openly elaborate information, reconcile conflicting viewpoints, and find creative solutions [39], [5].
Hypothesis 4 (H4).Psychological safety increases the effectiveness of software teams.
In the second process, psychological safety decreases relational conflict in teams by providing more opportunities to air grievances and discuss the tension between members.
Hypothesis 5 (H5).Psychological safety reduces the amount of relational conflict in software teams.
Concerning diversity, we expect that psychological safety is a social moderator of the association between diversity and team effectiveness.Consistent with the CEM and Diegmann & Rosenkranz [79], and as respectively our third and fourth processes, we anticipate as that psychological safety is a social moderator that creates an environment where diverse teams can more effectively elaborate task-related information and and experience less relational conflict than less diverse teams.Hypothesis 6 (H6).The relationship between diversity in gender (H6a), age (H6b), cultural background (H6c), and role (H6d) on the one hand and team effectiveness on the other is moderated by psychological safety.
Hypothesis 7 (H7).The relationship between diversity in gender (H7a), age (H7b), cultural background (H7c) and role (H7d) on the one hand and relational conflict on the other is moderated by psychological safety.

III. RESEARCH DESIGN
We conducted a sample study with a sample of software teams to answer our research question.We used Covariance-Based Structural Equation Modeling (CB-SEM) to test our hypotheses (as visualized in Figure 1).This section discusses the sample (Sec.III-A), measurement instruments (Sec.III-B), and method of analysis (Sec.III-C).

A. Participants
We performed our data collection process through a customized online survey that was embedded in a larger survey that was part of an online tool for Agile teams 1 .A pilot study was first performed between July and September 2021 to identify improvements for the questionnaire.256 teams participated.Two modifications were made.First, a scale for task-related conflict was removed because it was statistically indistinguishable from relational conflict in a Confirmatory Factor Analysis (CFA).Other studies have also reported this [116].Second, the worker councils of several participating organizations objected to an item that asked participants to identify their gender, despite assurances that answers would remain anonymous.So the item was replaced with a team-level indication of gender diversity to overcome this objection (see Section III-B for more detail).
Data collection for the primary study was then performed between September 2021 and January 2022.A mix of purposive and respondent-driven non-probabilistic sampling strategies [117] was used.While probabilistic strategies increase the likelihood of representative samples, they require knowledge of the true distribution of parameters in the population, which is typically not feasible for software engineering studies [43].Thus, we aimed to recruit experienced respondents by embedding the survey in a tool that is already used by many (Agile) software teams and promoted it across channels commonly visited by individuals interested in Agile and software development, including industry forums, blog articles, podcasts, and videos by influencers2 .To reduce sampling bias, we enabled respondents to also invite the members of their team to participate.This was encouraged by offering teams a report with anonymized team-level results for their team, along with helpful feedback.We were able to anonymously aggregate individual participants into teams as follows.Upon completion of a survey for a new team, each participant received a shareable link with a unique team key (a GUID) and instructions on how to invite the rest of their team.This participant then distributed that link through a channel of their choosing (e.g., email, Slack).We grouped all individuals by team key in the analyses.
In total, 1,827 members from 733 distinct software teams completed the survey in that period.Because the survey is public and accessible to anyone, we cannot properly calculate a response rate.Scholars have emphasized that public surveys are more susceptible to careless responses.So we applied several strategies outlined in literature [118] to reduce potential biases.First, we emphasized the anonymous nature of our data collection.Second, we encouraged honest answers by providing teams with a detailed team-level profile and relevant feedback for their team upon completion.Third, we removed 118 participants whose response patterns suggested careless responses.This consisted of participants who went through the survey too quickly to realistically read and answer the questions, thus preserving the robustness of the dataset.We followed a pragmatic cutoff described by Meade & Craig [118] and removed all participants with a response time below the 5th percentile (6.87 minutes).We also included participants who answered fewer than half of the questions (< 20), making them unsuitable for analyses.Finally, we retained only those teams (161) with at least 4 participating members to ensure a meaningful diversity measurement.The composition of our sample is shown in Table I.
Several variables in our model were measured at the individual level and aggregated to a higher (team) level in our analyses.Such aggregation is only reasonable when sufficient variance exists at the group level.The Intraclass Coefficients (ICC1 and ICC2) provide a measure for the proportion of group-level variance [19].We calculated this according to the procedure outlined by Van Mierlo et.al. [11] for all variables where individual-level measures were aggregated to the team level: psychological safety, relational conflict, team morale, and stakeholder satisfaction.Between 24% to 32% of the observed variation was attributable to the group level and indicated differences between teams rather than variation within teams (ICC1, p < .001).ICC1 values as low as 10% can already indicate group-level relationships that do not emerge from individual-level analyses and warrant group-level analyses [9], [8], [19].The group mean reliability (ICC2) indicates the agreement among members of each group and is typically interpreted as a reliability measure [7], [10].Values closer to 1 indicate higher reliability, while those closer to 0 suggest lower reliability.For our study, it ranged between .68 and .77and was deemed satisfactory.Given that values above .60are generally considered satisfactory in many research contexts [7], [10], our observed range underscores a significant level of agreement among team members on our various measures.This means that the ratings provided by individuals within the same team are relatively consistent, further justifying our aggregation approach.Taken together, the ICC1 and ICC2 both indicate robust group-level dynamics and warrant group-level analyses and aggregation.
We analyzed patterns in missing data at the individual level.Unless data is missing completely at random (MCAR), any patterns in missing data may bias the results of multivariate analysis [19].For this, we calculated Little's MCAR test.This is a Chi-Square test that compares the observed patterns of missing data with the patterns that would be expected from a process that results in random missing data [19].Our test showed that data wasn't completely missing at random (Chi 2 = 13, 200.799, df = 9, 853, p < 0.001) [19].A closer inspection of patterns in the missing data revealed that the percentage of missing data was below 2% for most items but slightly higher (up to 6.6%) for four items that measured aspects relating to stakeholders.Overall, missing data remained below the recommended threshold of 10% [19].No data was missing after we aggregated to the team level, so no imputation was performed.
We conducted an in-depth examination of missing data patterns at the individual level.It is well-established in the literature that when data is not missing entirely at random (MCAR), the presence of discernible missing data patterns can introduce potential bias into the results of multivariate analyses, as elucidated by Hair et al. [19].We employed Little's MCAR test, a Chi-Square statistical procedure that juxtaposes the observed patterns of missing data against the patterns one would anticipate under the assumption of random data omissions [19].Our examination of the data revealed a significant departure from the MCAR assumption, as evidenced by the test results (Chi 2 = 1, 013.462, df = 778, p < 0.001) [19].Further analyses of the missing data patterns unveiled that the majority of items displayed missing data percentages lower than 3%.However, we observed slightly higher rates for the four items related to stakeholder aspects of up to 6.6%.Nevertheless, the percentage of missing data remained within the confines of the recommended threshold of 10% [19].Because we aggregated data to the team level, at which no data was missing, we did not employ imputation procedures.
Finally, we performed a posthoc power analysis using G*Power [6], version 3.1.9.We determined that the sample size allows us to correctly capture medium effects (f = .15)with a statistical power of 96% (1 − β = .96).In other words, the probability of correctly rejecting the null hypothesis is 96% given our sample.So we are confident that our sample is big enough to provide a reliable outcome.

B. Measurements
Age, gender, role, and cultural diversity: To assess the impact of diversity on team effectiveness, we identified three dimensions of demographic diversity that are commonly studied (age, cultural background, and gender) and one informational dimension (role).The questions and the available categories are shown in Table 1 in the Supplementary Materials.
For role diversity, participants were asked to categorize their work into the most applicable software team role (e.g.developer, tester, designer, infrastructure) or a "Other"-category Cultural diversity is often operationalized through ethnicity as a proxy for differences in value systems, norms, and beliefs.We chose against this for two reasons.The first was that several worker unions of participating organizations objected to such measures during a pilot.Moreover, the European General Data Protection Regulation (GDPR) considers ethnicity as "particularly sensitive" personal data and generally prohibits its collection [104].The second reason is that ethnicity is a surface-level variable that does not imply cultural diversity.Indeed, a study by Desmet, Ortuño-Ortín & Wacziarg [105] in 76 countries showed that people with different ethnicities who live in the same country are more similar in value systems than people with the same ethnicities who live in different countries.Thus, we opted for another way to assess cultural diversity.
Participants were asked to identify the world region where they had lived the longest (e.g.Western Europe, North America, South-East Asia).This reflects a more dynamic understanding of cultural diversity than ethnicity or place of birth because it recognizes cultural mobility [106] and cultural exposure in identity formation [107].This assumes that the region where one has lived the longest most substantially shapes the value system one brings to a team.
Because our study is aimed at team-level diversity, we aggregated individual indicators for age, role, and cultural background to a team-level Gini-Simpson coefficient.This coefficient is a statistical indicator of the diversity of the members in a sample, ranging between 0 (no diversity) and 1 (maximum diversity) [38].
Gender diversity was measured directly at the team level.Similarly to cultural diversity, several worker councils objected to questions that measured gender at the individual level during our pilot study.We also observed that many participants left the question unanswered.To prevent this measure from becoming an obstacle to participation, we instead asked the initiating participant of each team to indicate the gender distribution at the team level (% women and men).We recognize there are more genders.However, we had to take this shortcut to obtain a reliable statistical analysis.
Team Effectiveness was operationalized similarly to Verwijs & Russo [39].Team effectiveness is often defined as "the degree to which a team meets the expectations of the quality of the outcome" [33].In this sense, stakeholder satisfaction is the evaluation of team outcomes from the external perspective of stakeholders (e.g., clients, customers, and users), whereas team morale is the evaluation of team outcomes from the internal perspective of team members.This is conceptually similar to how team effectiveness is defined in the "Team Diagnostic Survey (TDS)" [37].For team morale, we used 3 items from the "Utrecht Work Engagement Scale" (UWES) scale [40] that were modified for use in teams by Van Boxmeer et al. [28].For stakeholder satisfaction, we used a 4-item scale developed by the authors for another study [39].Both measures are self-reported.Reliability analysis (Cronbach's alpha) showed that Team Morale (α = .910)and Stakeholder Satisfaction (α = 0.832) were consistently measured across participants.
Relational conflict was operationalized by adapting three items from a scale developed by Jehn et al. [78] to measure relationship conflict.Such conflicts represent interpersonal incompatibilities between team members that "typically includes tension, animosity, and annoyance among members within a group" [78, p. 258].The items were adapted for use in teams by the authors.The reliability of measurements across participants was high (α = .892).
Psychological Safety was operationalized by adapting three items from the "Inquiry & Dialogue" scale that was developed by Marsick & Watkins [35] as part of the Dimensions of Organizational Learning Questionnaire (DLOQ).The items were adapted for use in teams by the authors.The reliability of measurements across participants was high (α = .791).
Control Variables: We included two items from the social responsibility scale (SDRS5) [16] to control for socially desirable answers and to control for common method bias [15].Four categorical items were included to control for contextual variables that might influence both our independent and our dependent variables; team size, organization size, product type, and organization sector.Their categories are shown in Table I.
The measurement reliability of our scales is summarized in Table II.

C. Analysis
We employed Structural Equation Modeling (SEM) with the AMOS software package [20] to analyze the data.A strength of SEM is that it is an inherently confirmatory approach that combines multiple linear regressions and confirmatory factor analysis (CFA) with Maximum Likelihood estimation (ML) to produce more consistent and less biased estimates than those derived through Ordinary Least Squares (OLS) that is typically used in multiple regression and ANOVA [19].Furthermore, SEM allows researchers to simultaneously test both the structural part of a theory -the relationships between independent and dependent variables -and the measurement model -the inclusion of multiple indicators to measure latent factors [19], [22], [12].This is particularly useful for psychometric scales that use multiple questions to operationalize an underlying construct, as we do in this study.
In SEM, the statistical model is evaluated through several "Goodness of Fit" indices and the statistical significance and effect size of individual paths.The aim is to arrive at a model as parsimonious as possible while providing a good fit (and thus explanatory power).We discuss the fit indices in section III-D Next, we tested our data for the necessary statistical assumptions required for Structural Equation Modeling.First, we assessed normality by comparing our independent, dependent, and control variables against recommended thresholds for kurtosis (< 3) and skew (< 2) [17] in literature.This was satisfactory for all variables except cultural diversity, whose distribution was strongly leptokurtic.This means that only a few teams showed some heterogeneity in cultural diversity, whereas most were completely homogeneous.Although statistical transformations can re-normalize such distributions, this also inevitably complicates their interpretation, especially the comparison with other effects in a model [19].
Our measure for gender diversity was not continuous but ordinal (no diversity, some diversity, or high diversity).We treated this variable as continuous in our analyses because such a model is more parsimonious than one that treats it categorical [98].It also simplifies the interpretation and retains more information than a model where the ordinal variable is treated as categorical.However, this requires that the relationship between the dependent variable and the ordinal independent variable is linear and that each step is approximately evenly spaced [98].The relationship was linear, but the steps were relatively not evenly spaced (respectively .33 and .18for both steps).However, the modest violation did not warrant using a less parsimonious model with categorical dummy variables for gender diversity instead of a single variable.
We assessed homoscedasticity by inspecting the scatter plots for all pairs of independent and dependent variables for inconsistent patterns but found none.Finally, multicollinearity was assessed by entering all independent variables one by one into a linear regression [18].The Variance Inflation Factor (VIF) remained below the critical threshold of 10 [19] for all measures, ranging between 1.01 and 2.75.
Using a single method -like a questionnaire -introduces the potential for a systematic response bias where the method itself influences answers [23].To control for such common method bias, the recommended approach in current literature is using a marker variable that is theoretically unrelated to other factors in the model [15].We included two items from the social responsibility scale (SDRS5) [16] and found a small but significant unevenly distributed response bias.Following recommendations in the literature, we retained the marker variable "social desirability" in our causal model to control for common method bias [15].
We created a full latent variable model containing both the measurement and structural models.The measurement model defines relationships between indicator variables (survey items) and underlying first-order latent factors and effectively acts as a CFA-model [92].The structural model defines the hypothesized relations between latent variables and is a regression model.This approach makes the results less prone to convergence issues because of low indicator reliability and offers more degrees of freedom to the analysis compared to a non-latent model [21].We began by assessing the measurement model following the approach outlined in literature [22], [12], [19].Psychological safety, relational conflict, team morale, and stakeholder satisfaction were entered as first-order latent factors, with their respective survey items as indicator variables.Once the measurement model exhibited a good fit (see section III-D), we added the structural part of the model.
In the structural part of the model, we created a secondorder latent factor to reflect the composite nature of "team effectiveness".The first-order latent factors for team morale and stakeholder satisfaction were modeled as indicators, similar to Verwijs & Russo [39].We calculated interaction terms by multiplying each team's standardized factor score for psychological safety with their standardized scores for each diversity indicator (age, gender, role, cultural background) [93].The diversity indicators and the interaction terms were entered into the model as exogenous variables.The exogenous variables for psychological safety, the diversity indicators, and their interaction terms were allowed to co-vary.No covariances were allowed between endogenous variables as our model predicted specific paths between them.

D. Model fit evaluations
We assessed reliability, convergent, and discriminant validity for the resulting measurement model before testing for the model fit.The individual steps involved in the model-fitting process are in Table 2 in the Appendix.Discriminant validity was assessed by analyzing the heterotrait-monotrait ratio of correlations (HTMT) with a third-party plugin in AMOS [24] and following the approach outlined in literature [19], [25].This ratio between trait correlations and within trait correlations should remain below R = .90to indicate good discriminant validity from other constructs in different settings.This was the case for all measures.We assessed convergent validity by inspecting composite reliability (CR) and average extracted variance (AVE).The AVE remained above the rule of thumb of > .50[19] for all pairs of factors, ranging between .621 and .890.The CR was equal to or above the threshold of .70 [19] for all scales.
We then proceeded with the fitting procedure.We investigated local fit by inspecting the residual covariance matrix.A standardized residual covariance is considered large when it exceeds 2.58 [22].This indicates that an item does not sufficiently measure (only) its intended factor.One item from Stakeholder Satisfaction (StakeholderSatisfaction3) showed poor local fit, and we removed it.
The overall goodness of fit was evaluated with indices recommended by recent literature [92], [22], [19]; the Comparative Fit Index (CFI), the Root Mean Error of Approximation (RMSEA), the Standardized Root Mean Residual (SRMR) and the Tucker Lewis Index (TLI).The Comparative Fit Index (CFI) [95] offers a similar test to CMIN/df but with consideration of the sample size and its reliable properties have made it the most commonly used index today [19].A cut-off value of .95 or higher is generally considered to indicate good fit [22], [19], [26].The Root Mean Error of Approximation (RMSEA) by Steiger & Lind [96] also provides an index that considers sample size but adds to this a parsimony adjustment that leads it to favor the simplest model out of potential models with the same explanatory power [92].A value below .05 is generally considered to indicate a good fit [22], [19]; additionally, we follow the advice to report the confidence interval in addition to only the absolute value [97].
The Standardized Root Mean Residual (SRMR) calculates a standardized mean of all the differences (residuals) between each observed covariance and the hypothesized covariance between variables [19].A value below .08 is indicative of a good fit.We also inspected local fit by looking at the standardized residuals between pairs of variables, with values beyond 2.58 as a cut-off value for poor local fit [22].Finally, we report and test the Tucker-Lewis Index (TLI).This is another incremental fit index, like the CFI, that compares the relative improvement of the hypothesized model from a model where all variables are uncorrelated.Hair et al. [19] considers value of .97 or above sufficient to conclude a good model fit.In addition to overall model fit, we also evaluated our model on the percentage of variance that is explained in team effectiveness by all other variables in the model.The measurement model fitted our data well (Chi 2 (79) = 127.650;T LI = .973;CF I = .980;RM SEA = .062;SRM R = .0516).A Confirmatory Factor Analysis (CFA) is reported in the Appendix (Table 3) that shows that all items loaded primarily on their intended factors, except for the item PsychologicalSafety1.This item also loaded negatively on the factor for Relational Conflict.The cumulative Eigenvalues of 5 factors explain 78% of the total observed variance, which is well beyond the recommended threshold of 60% [19].
We then tested the path model for the effects we predicted from our theory.Our hypothesized theoretical model fits the data well on each fit indices, as described in Table III: Chi 2 (129) = 156.282;T LI = .981;CF I = .988;RM SEA = .036;SRM R = .051.The predictors in our model explain respectively 40.7% of the variance in the latent factor representing team effectiveness.For studies in the social sciences, values above 26% are considered large [31].All steps of the fitting process are listed in Table 2 of the Appendix.
To enhance the generalizability of our model, we extended our analysis to include four control variables: team size, organization size, sector, and product type.The aim of this test was to verify that the inclusion of these control variables did not meaningfully alter the path coefficients or their significance [92], [22].The resulting alternative model, which integrated these control variables, demonstrated lower fit compared to the original model (Chi 2 (155) = 210.603;T LI = .954;CF I = .976;RM SEA = .047;SRM R = .0608).This means that the uncontrolled model is more parsimonious with similar or better predictive power.Crucially, the inclusion of these variables did not alter the relationships within our model in a meaningful way.This suggests that our findings are robust and generalizable, and unaffected by the team-level and organization-level factors we controlled for.

IV. RESULTS
We now turn to the results and hypothesis testing.The means, standard deviations, and Pearson correlations of all variables are reported in Table IV.Significant effects are also visualized in Figure 2. Following recommendations in statistical literature [22], [92], we used a bootstrapping procedure with 2,000 samples and 95% bias-corrected confidence intervals to more accurately estimate parameters and their p-values for direct effects, factor loadings, and the hypothesized indirect effects.This resulted in a standardized, bias-corrected estimate (β) for each path, along with a p-value to test whether the null hypothesis can be rejected.β represents the change in the predicted variable in standard deviation units for a one standard deviation change in the predictor variable while holding all other variables in the model constant.They can be used to compare the strength of an effect as compared to other effects in the same model [22].
The parameter estimates relevant to our hypotheses are reported in Table V.
Our results allowed us to reject the null hypotheses for 4 out of 19 (sub)hypotheses.The primary hypothesis of this study is that diversity makes software teams more effective because it broadens the cognitive resources available for information processing (H1a-d).This is partially true for our results, as only age diversity significantly contributes to team effectiveness (H1b, β = .213,p < .05).Software teams seem slightly more effective when there is greater age heterogeneity.In contrast, heterogeneity in gender, cultural background, or role does not appear relevant to team effectiveness ( H1a,c,d).
We also hypothesized that relational conflict in teams would increase as heterogeneity increases and members become less similar.This is also partially true, as only gender diversity significantly contributes to relational conflict (H2a, β = .161,p < .01).Thus, there appears to be more conflict as teams grow more heterogeneous in gender.There is no discernible effect of diversity in age, cultural background, or functional role on conflict.( H2b,c,d).
Contrary to our expectations, we did not find a significant effect between relational conflict and team effectiveness (H3).Teams that experience more relational conflict do not seem to be more or less effective than teams that experience less conflict.However, the results show a strong positive effect of psychological safety on team effectiveness (H4, β = .660,p < .01).Teams that experience more psychological safety are more effective in that they have reported more satisfied stakeholders and higher team morale.Psychological safety also strongly decreases the amount of relational conflict reported by teams (H5, β = −.636,p < .01).
Finally, we hypothesized that psychological safety moderates the strength by which diversity contributes to team effectiveness and relational conflict.However, none of the interactions were significant (H6a-d; H7a-d).
V. DISCUSSION This study investigated how diversity in age, role, cultural background, and gender influences the effectiveness of software teams.1,118 respondents from 161 software teams participated in our study.Overall, our results provide mixed support for both the benefits and the risks of member heterogeneity in teams.A summary of our findings is provided in Table VI.
According to the categorization-elaboration model (CEM) [62] and cognitive resource diversity theory, we hypothesized that software teams benefit from diversity as it expands the cognitive resources available for information processing.However, only age diversity improves team effectiveness directly.In other words, teams are more effective when their members vary in age.This is probably a proxy for differences in tenure and experience that encourage innovation and creativity [102].However, generational differences in work values have also been found to be relevant [100].Either way, this is in support of cognitive resource diversity theory and its prediction that diversity expands cognitive resources that teams have access to.
Our findings are consistent with the conclusions from a recent review of the literature by Tshetshema & Chan [45], and a meta-analysis of 74 studies by Schneid et al. [58], particularly for complex tasks.However, another meta-analysis of 35 studies by Horwitz & Horwitz [54] found no positive impact of demographic diversity (age, gender, race).So our results are more nuanced than the overall positive effect of team diversity that is reported by Lee & Xia [41] for software teams.We also did not find a positive effect of gender diversity or cultural diversity, whereas others did [45], [43].All in all, the association between demographic diversity and team effectiveness is more complicated than the direct, positive effects we hypothesized.
In addition to demographic diversity, we also investigated how role diversity improves team effectiveness.Agile software methodologies in particular emphasize this type of diversity as an important characteristic of autonomous teams [65], [50].In line with cognitive resource diversity theory, role diversity allows teams to leverage more perspectives and broader informational resources to resolve complex problems [62], [55].When members bring more functional roles to their work together (e.g., analyst, tester, developer, designer), their shared mental models will be richer than when all members hold the same role (e.g., developer) [53], [3].However, we did not find evidence for this.Teams with high role diversity were not more or less effective than teams with lower role diversity.This is partially consistent with extant literature.Homberg & Bui [63] found no evidence for a link between role diversity and team effectiveness in a meta-analysis of other empirical studies.Horwitz & Horwitz [54] also did not find an effect on team performance, although they did find one on the quality of work done by teams.
Diversity in teams is often considered a double-edged sword in the literature on diversity [55].The CEM proposes that diversity can also harm team effectiveness through the similarity-attraction paradigm [49].As members grow less similar and bring different perspectives to teamwork, there is more potential for tension and conflict.This decreases the ability of teams to elaborate information effectively and reduces their effectiveness.Concerning the first assertion, our results show that gender diversity does increase relational conflict but not other kinds of diversity.This finding is consistent with some studies [60], but not others [43], [45].Regarding the second assertion, we failed to find any impact of relational conflict on team effectiveness.So while it appears true that gender diversity increases relational conflict in teams to some extent, we cannot conclude that this also harms team effectiveness (i.e., the double-edged sword).
The CEM attempts to reconcile the conflicting results by drawing attention to social-and task-related moderators that shape how diversity impacts team effectiveness.We investigated one social moderator frequently associated with diversity, relational conflict, and team effectiveness: psychological safety.We hypothesized that a psychologically safe environment would make it easier for diverse teams to elaborate on task information effectively.Although psychological safety reduced relational conflict and improved team effectiveness, we could not reject the null hypotheses for psychological safety as a moderator of the diversity-effectiveness link.In summary, our results show some benefits of diversity (age) on team effectiveness and some risks of diversity through relational conflict (gender).Psychological safety also reduces relational conflict and increases team effectiveness, but we found no evidence for a moderating role in the diversity-effectiveness link or the diversity-conflict link.

A. Alternative explanations
The mixed evidence suggests that there are factors at work that moderate or mediate the effects of diversity on effectiveness and conflict.Diversity alone does not make teams more effective because it broadens cognitive resources, just as it does not inherently and consistently create conflict because members are less similar.
This study investigated psychological safety as one potential social moderator of the diversity-effectiveness link.Our mixed results suggest that other moderators are at play.One example of this is task interdependence.A core element of Agile software methodologies is that teams work together on complex tasks [2], [50], [29].Collective elaboration of task-related information and the pooling of skills to accomplish tasks is also a common thread in the definition of teamwork [33], [88].Without task interdependence, the two mechanisms of diversity diminish.Because there is less collective elaboration, the benefits of the broadened cognitive resources that are offered by diversity diminish.Furthermore, a major source of conflict between members is removed because they spend much less time together processing information.Members may have more "skin in the game" when they feel they depend on others in their team to be successful.Paradoxically, this may surface as a higher degree of relational conflict than teams with very low interdependence.In this sense, psychological safety is likely only relevant as a moderator of the diversity-effectiveness link in teams with high task interdependence but not low task interdependence.Future studies can investigate if the effects of diversity and psychological safety are indeed more pronounced when controlling for task interdependence.
Another explanation may be that the effect of diversity on team effectiveness is not linear.Several authors [68], [69] have argued for curvilinear models where diversity contributes to effectiveness only when it is moderated (inverted U) or when it is either low or high (upright U).Which model applies varies by diversity type.For example, Dahlin, Weingart & Hinds [56] found that educational diversity contributed to team performance when it was either low or high (upright U) but found the opposite for national diversity (inverted U).Richard et al. [70] found that management teams with moderate gender diversity performed better than teams with low or high diversity, but only in high-risk settings (inverted U).However, diversity in terms of age, gender, or function may contribute to learning behavior in teams more strongly when diversity is low or high but not moderate (upright U) [71].So while there is some support for the curvilinear effects of diversity, the relationship is complex.To further complicate matters, the shape of the relationship may also be moderated by the expectations that teams themselves have of the benefits of diversity [67].We performed a posthoc test to assess whether a curvilinear relationship between dimensions of diversity and team effectiveness better fitted the data.This was not the case.A quadratic regression model was not significant for the following diversity dimensions: age (R 2 = .004,F (2, 158) = .321,p = .726),gender (R 2 = .021,F (2, 158) = 1.695, p = .187),culture (R 2 = .008,F (2, 158) = .664,p = .516),and role (R 2 = .000,F (2, 158) = .025,p = .975).Thus, the possibility of a curvilinear relationship rather than a linear one does not appear to explain the lack of results in this study.
We often assume that diversity in age, gender, function, and cultural background inherently leads to a different understanding of the task and potential solutions.This is both the strength and the weakness of diverse teams.In the day-today practice of teams, such differences in understanding may also lead to conflict if members need to adequately express their views and integrate them with other members into a synthesized solution.In addition to the task-related and social moderators mentioned above, it is reasonable to expect that communication and conflict navigation skills are also highly relevant, as well as the presence of an environment where such different understandings can be elaborated effectively.Few studies have investigated such moderators, particularly for software teams [52].Furthermore, this ties into team members' beliefs about diversity, how to deal with it, and whether or not it benefits teamwork.Van Knippenberg et al. [74], [67] call this a "Diversity Mind-Set".Several studies have shown that teams and organizations can better leverage diversity when they recognize it as a strength and have learned how to appreciate and deal with the resulting informational diversity [75], [70], [67].
For practitioners, it is important to notice that our results are broadly consistent with existing research, showing that team diversity is not unequivocally beneficial or harmful.Although we found a positive effect of age diversity, the effects of other types of diversity appear to be more conditional on moderating factors.Several factors have been proposed to date, like the autonomy that teams have [41], task difficulty [53], psychological safety [14], team climate [46] and the beliefs that teams have about diversity [67].This suggests that context is just as important as diversity alone.The practical implications of our findings are summarized in Table VI.

B. Limitations
In the following section, we discuss the threats to the validity of our sample study.
Internal validity Internal validity refers to the confidence with which changes in the dependent variables can be attributed to the independent variables and not other uncontrolled factors [76].We employed several strategies to maximize internal validity.First, we recognized that online questionnaires are prone to bias and self-selection as a result of their voluntary (non-probabilistic) nature.We counteracted this by embedding our questions in a tool that is regularly used by software teams to self-diagnose their process and identify improvements.Team members were invited by people in their organization to participate.Second, we thoroughly cleaned the dataset of careless responses to prevent them from influencing the results.Third, we did not inform the participants of our specific research questions to prevent them from answering in a socially desirable manner.We also controlled for social desirability in participants' responses, as well as common method bias introduced when a single method is used to collect data.Finally, we tested an alternative model to rule out that variation in team size, organization sector, organization size, or product type instead explained our findings, which was not the case.
Despite our safeguards, there may still be confounding variables that we were unable to control for.This is particularly relevant to the operationalization of team effectiveness, which is based on self-reported scores on team morale and the perceived satisfaction of stakeholders.Mathieu et al. [77] recognize that such affect-based measures may suffer from a "halo effect".Future studies could ask stakeholders to rate their satisfaction with team outcomes directly.This does not entirely rule out a halo effect but is conceptually closer to what matters to organizations.Future studies could also find more objective measures for team effectiveness.
Construct validity Construct validity refers to the degree to which the measures used in a study measure their intended constructs [76].We adapted items from established scales to measure psychological safety [30], team effectiveness [39], relational conflict [78] and social desirability [16].A confirmatory factor analysis (CFA) showed that all items were loaded primarily on their intended scales (see Table 3 in the Appendix).A heterotrait-monotrait (HTMT) analysis confirmed discriminant validity for all measures.The reliability for all measures exceeded the cutoff recommended in the literature (CR >= .70[19]), except social desirability.Thus, we are confident that we reliably measured the intended constructs.
A limitation of our measure for team effectiveness is that it only addressed (self-reported) stakeholder satisfaction and team morale.Although both are reasonable and relevant aspects of team effectiveness and are commonly used in team research [33], effectiveness is also a more-faceted construct [73].
Our measure for role diversity captured the role for 88% of participants.7.2% picked the "Other" category.While this probably reflects a long tail of more niche roles, a more sensitive measure with more than the seven roles currently provided would've increased the resolution for role diversity slightly.
While our sample consisted of software teams, we note that all practiced Agile methodologies.This was because the survey was advertised primarily on channels in the Agile community.Due to the focus on close collaboration and high task interdependence through shared goals in Agile methodologies, it is possible that the association between diversity, psychological safety, and relational conflict is different in other kinds of software teams.However, Agile methodologies have become so prevalent that most teams use them in one form or another [1].
As with every sample study, the way we operationalized diversity influenced our results.This is particularly relevant for cultural diversity.We observed very little diversity in teams on this variable.Our operationalization assumed that the region where participants lived the longest most strongly influenced their value systems.However, since we used large regions (e.g., Western Europe, and Africa) the resolution of this measure may simply have been too low.Future studies that wish to use the same operationalization do well to expand the categories, perhaps even to individual countries.Another issue is that our operationalization is yet another proxy for cultural values.Other proxies such as ethnicity, race, and place of birth have been shown to be unreliable measures of actual cultural diversity [108].A more direct measure of (cultural) value systems could have resulted in more robust differences.For example, House et.al. [109] identified 9 cultural dimensions based on quantitative data from 62 countries which partially overlap with earlier work by Hofstede [110].This includes dimensions that appear relevant to teamwork, such as "Power distance", "Assertiveness", "Performance orientation" and "Uncertainty avoidance".Several studies suggest that diversity on such dimensions impacts teamwork [111], [112].Thus, such traits offer a more promising operationalization of cultural diversity in future research.
Finally, we could not directly ask participants for their gender due to privacy concerns.So it was not possible to calculate a Gini index as we did for the diversity measures.The resulting measure was ordinal instead of continuous, limiting our analysis's resolution for this variable.Future studies would do well to use a more continuous measure of gender

Diversity & relational conflict
Gender diversity was positively associated with relational conflict in software teams (β = .161,p < .01).However, diversity in role, age, or cultural background did not.In turn, relational conflict did not significantly affect team effectiveness.
When teams grow more diverse, members' different perspectives may lead to more conflict and friction.This appears particularly relevant to gender diversity.Such negative consequences of diversity may be counteracted when teams learn to see their diversity as a strength and recognize that different perspectives can be reconciled through open dialogue and elaboration.

Psychological safety & team effectiveness
Psychological safety was positively associated with team effectiveness (β = .660,p < .01)and negatively associated with relational conflict (β = −.636,p < .01) Teams that operate in environments where members can openly and safely elaborate information are more effective than other teams, regardless of their diversity.They also experience much less relational conflict.Organizations do well to develop the skills, support structures, and management styles that foster psychological safety in and around teams

Psychological safety as a moderator
Psychological safety did not significantly moderate the association between diversity and team effectiveness, nor between diversity and relational conflict.
Psychological safety is paramount, but it does not appear to strengthen the cognitive benefits of team diversity, nor does not it appear to buffer against negative consequences. distribution.

Conclusion validity
Conclusion validity assesses the extent to which the conclusions about the relationships between variables are reasonable based on the results [89].We used Structural Equation Modeling to test the entire model simultaneously [92], [22].The resulting model fits the data well on all fit indices recommended by statistical literature and explains a substantial amount of variance in the dependent variables.Our sample was also large enough to identify medium effects (f = .15)with a statistical power of 96%.
We published team-level data and syntax files to Zenodo for reproducibility.
External validity Finally, external validity concerns the extent to which the results actually represent the broader population [90].First, we assess the ecological validity of our results to be high.Our questionnaire was integrated into a more general tool that Agile software teams use to improve their processes.Participants were invited by people in their organization, usually Scrum Masters.Thus, the data is more likely to reflect realistic teams than a stand-alone questionnaire or an experimental design.
We acknowledge the potential limitations in the representativeness of our sample.Given the survey's public nature, participation was voluntary, which may have introduced a selfselection bias.It remains a possibility that the participating teams differed significantly from non-participants in terms of the variables studied or their interrelationships.To mitigate such biases, we implemented multiple bias-reducing strategies.We extensively promoted the survey across diverse online platforms frequented by team coaches, such as Agile coaches and Scrum Masters, as well as developers.We underscored the importance of anonymity, ensuring participants that their responses would remain confidential and not be shared with their respective teams or organizations.As an incentive, we offered teams a comprehensive profile, complete with actionable feedback.Our sample composition, as detailed in Table I, indicates participation from a diverse array of teams.These teams vary in experience, geographical location, and organizational type.The broad spectrum of scores across different measures further reinforces the diversity of our sample.The substantial sample size, combined with the aggregation of individual responses to team-level summaries, diminishes the variability arising from non-systematic individual biases.

VI. CONCLUSION
A common thread in software methodologies, such as Agile, is their emphasis on teams as the primary units where complex work is performed.So it is not surprising that much research has focused on what makes such teams more effective (i.e.[39], [5], [4], [42], [27]).Although diversity is increasingly investigated in the broader literature on teams, scholarly knowledge on how it impacts software teams is still limited [52], [94].Such understanding can better equip organizations and teams to leverage diversity more effectively or learn when and how diversity is beneficial.Because what seems to be clear about diversity is that while it brings more extensive cognitive resources to teams, it can also bring more conflict as members become less similar [55].Several models have been proposed to explain this "double-edged sword" of diversity, with the categorization-elaboration model (CEM) [62] as the most comprehensive one.
In this study, we explored how diversity impacts the effectiveness of software teams through the lens of the CEM theory.Our sample consisted of 1,118 team members representing 161 (Agile) software teams.Our results show that age diversity contributes to more effective teamwork but not diversity in gender, role, or cultural background.This may reflect the value of having more varied levels of experience in teams.Furthermore, the CEM also predicts a negative effect of diversity through social categorization and identity threat, which can surface through increased conflict.While our results support this effect, we only found evidence for gender diversity.Finally, the CEM predicts that task-related and social moderators influence the impact of diversity.One such moderator that is frequently studied is psychological safety [14].While our results show that it contributes to more effective teamwork and less conflict in teams, it did not moderate the link between diversity on the one hand and effectiveness and conflict on the other.Thus, the presence of psychological safety in a team does not in itself allow teams to leverage their diversity better.Despite the strong focus on role diversity and crossfunctional teamwork in software methodologies [2], [50], we found no apparent effect on team effectiveness.So while our results are broadly consistent with the CEM for age and gender diversity, it is surprising that heterogeneity in role or cultural background did not produce similar effects.One moderator that may be particularly relevant here for future research is task interdependence.Teams vary broadly in the degree to which members actually (need to) work together on tasks and, thus, the opportunities that arise to leverage the broader cognitive resources of diverse teams.
This study has several implications for future studies of how diversity impacts the effectiveness of software teams.First, the role of task-related and social moderators should be investigated more thoroughly.The categorization-elaboration model [62] provides a valuable framework for such research because it integrates the opposing mechanisms of diversity proposed by cognitive resource diversity theory and the similarity-attraction paradigm.From a practical viewpoint, such research can also drive the development of training and methods to help teams and organizations to leverage their diversity on all sorts of dimensions, and not limited to gender, age, cultural background, and functional role.Second, more attention should be paid to the beliefs that teams have about diversity and its effects.Such a "Diversity Mind-Set" [67] can act as a powerful moderator by making teams aware of their diversity and how it can expand their experience as a team.Third, future studies can investigate the role of task interdependence as a moderator of the relationship between psychological safety and relational conflict.Finally, future research should investigate broader definitions of performance and effectiveness.In this study, we mainly focused on stakeholder satisfaction and team morale.Since effectiveness is a multi-faceted construct [73], we likely missed aspects that are affected by diversity in teams, like speed, quality, or innovativeness.

Fig. 1 .
Fig. 1.Theoretical model and hypotheses.Sub-hypotheses are grouped, and control variables are omitted to retain visual clarity

TABLE II SCALES
USED IN THE SURVEY STUDY, ALONG WITH ATTRIBUTION, NUMBER OF ITEMS, AND RELIABILITY (CRONBACH'S ALPHA) BASED ON RESPONDENT-LEVEL RESPONSE DATA (N = 1, 118)

TABLE IV MEANS
, STANDARD DEVIATIONS, SKEWNESS, KURTOSIS AND CORRELATIONS (PEARSON) FOR CONTINUOUS VARIABLES.CORRELATIONS MARKED WITH * ARE SIGNIFICANT AT p < 0.05 Fig. 2. Standardized path coefficients for the model ( * * : p < .01,*: p < .05).The dotted lines represent non-significant results.Indicator items and non-significant paths for sub-hypotheses are omitted to improve readability.A detailed overview of the individual hypotheses is reported in Table V.

TABLE V PARAMETER
ESTIMATES, CONFIDENCE INTERVALS, STANDARD ERRORS, STANDARDIZED COEFFICIENTS FOR DIRECT EFFECTS, INTERACTION TERMS AND INDIRECT EFFECTS FOR HYPOTHESES, AND FACTOR LOADINGS.SIGNIFICANT EFFECTS ARE MARKED WITH * * : p < .01,* : p < 0.05

TABLE VI SUMMARY
OF KEY FINDINGS & IMPLICATIONS FOR PRACTICEBased on existing theory, we developed a Structural Equation Model for how diversity and psychological safety interact to impact team effectiveness and relational conflict.The model fitted the data well (Chi 2 (129) = 156.282;T LI = .981;CF I = .988;RM SEA = .036;SRM R = .051).