The Effect of Code Smells on the Relationship Between Design Patterns and Defects

The relationship between design patterns and defects was investigated in the literature, but with mixed results. While the majority of studies found the presence of patterns to be positively correlated with defects, other works reported the opposite conclusions. This may suggest that contextual factors affect this relationship. In this study we analyze the role of code smells as a confounding variable in the relationship between design patterns and defects in Java classes. To investigate this, we applied statistical tests to capture the difference in the impact on defects between pattern classes with/without code smells in 10 Java systems from the PROMISE dataset, with respect to 13 design patterns and 10 code smells. The presence of code smells in patterns appears to be a valid factor affecting defect-proneness. Pattern classes with smells attract more defects than non-smelly pattern classes, and in most systems smelly design patterns are positively correlated with the presence of defects, while non-smelly patterns tend to have no impact, or a slightly negative impact on the presence of defects. As a result, the presence of code smells in design pattern classes appears to be a contextual factor affecting the defect-proneness of the subject code.


I. INTRODUCTION
Design patterns are generic object-oriented solutions to frequently occurring design problems. They were first introduced in the Gang of Four's (GOF) book [18], which promoted design patterns as implementations of good practices that can improve re-usability, maintainability, and understandability.
Over the years, the relationship between design patterns and measurable quality attributes has attracted the interest of many researchers. While several studies found patterns to be positively correlated with various code characteristics, such as maintainability [42], defects [52] or the absence of code smells [53], other studies reported the negative impact of patterns on software maintainability [54] and code evolution [28]. Additionally, several other studies were inconclusive or attributed the observed effects to the nature of the patterns, to the way they are used, or to other contextual factors [6], [39]. Those findings not only conjecture that the relationship between design patterns and code quality attributes is not direct, but they also suggest that various contextual factors may have played an important role in this relationship. In the current study, we investigate whether the The associate editor coordinating the review of this manuscript and approving it for publication was Taehong Kim .
presence of code smells could be a contextual factor in the relationship between patterns and defects.
Code smells [17] are surface-level indicators that may correspond to deeper problems in software systems. The relationship between smells and code quality attributes drew researchers' attention and several studies were conducted to investigate this relationship. Some of these studies reported the negative impact of smells on change/fault proneness [31], [41], defect proneness [11] and maintainability [55]. On the other hand, other studies found that the detrimental effect of smells on various quality characteristics is limited or even positive, e.g., defect proneness [30], changeability [38] and maintenance effort [47].
As in the case of design patterns, the relationship between smells and quality attributes is not decisively defined, which constitutes another reason for conducting this study. Moreover, besides the importance of researching the context in which the defects may exist, another motivation drove our study; the phenomena in question, i.e., patterns and smells, are different in nature: while design patterns are intentionally used to achieve specific design objectives, code smells are inadvertent side effects of software development.
Defects, colloquially called bugs, are conditions in software products which do not meet the requirements or the customers' expectations. Defects often arise because of an incomprehensive understanding of the code, a fault interpretation of the business requirements, programmers' mistakes or due to frequent code changes [2]. The process of detecting and removing software defects is an important step in guaranteeing the fulfilment of end user satisfaction [20] and reducing the economic liability associated with releasing flawed software products [26]. Furthermore, building an efficient defect prediction model has raised increased interest from researchers in their quest to learn from the previous defects and to use this knowledge to predict future ones [15]. Defects were also heavily investigated in the literature, with several studies considering defects as a dependent variable affected by various independent variables, such as patterns [6], [21], [52], bad design [57] and the presence of code smells [30]. Defects should not be confused with code smells. They refer to two distinct quality characteristics: defect to reliability and smells to maintainability. In our study we consider them separately.
This article focuses on the impact of smelly or non-smelly design patterns as independent variables on defects, considered as binary variables (i.e., present/absent) or quantitative variables (i.e., the number of defects). We seek to determine the confounding effect of smells on code that contains design patterns, in terms of the resulting defects.
The paper is structured as follows. In Sec. II we present the literature review on the relationship between patterns and smells, and their separate effect on defect-proneness. Next, in Sec. III, we describe the design of the experiment and formulate the research questions. The results are presented in Sec. IV, then in Sec. V we discuss and interpret them. Finally, in Sec. VII we provide conclusions and propose possible directions for future investigation.

II. RELATED WORK
While the relationship between design patterns and code smells is a relatively new topic, both smells and patterns have been separate subjects of research for a long time. Below we summarize the major findings concerning their relationship with each other and their separate links with defects.

A. DESIGN PATTERNS AND DEFECTS
A number of studies investigated the relationship between design patterns and defects. However, the results were mixed and sometimes contradictory. Vokáč [52] observed the evolution of a commercial product for three years in an attempt to compare the defect ratios in classes with selected design patterns, namely Observer, Singleton, Factory and Template Method, with other classes. They concluded that both Observer and Singleton are more defect-prone than other patterns due to the larger code structures they contain. On the other hand, the Factory pattern displayed a lower number of defects than other classes, and the results for Template Method were inconclusive. Moreover, Gatrell and Counsell [19] examined a subset of a large commercial software system written in C and found that classes implementing design patterns are more fault-prone than the non-pattern classes. In particular, the authors found some patterns, namely Adaptor, Template Method and Singleton to be more defect-prone than others.
Elish and Mohammed [13] found no difference in the fault density between classes participating in the creational or behavioural patterns and classes without patterns, while structural patterns appeared to have a lower fault density than other classes. Furthermore, Onarcan and Fu [39] presented a study of 26 open source software projects, showing that there is a little correlation between the number of pattern instances in those projects and the number of defects. They also concluded that individual design patterns may have either positive or negative impact on defect-proneness.
Additionally, Aversano et al. [6] reported an empirical study involving three open source systems and concluded that the defect density of pattern classes is higher if the implementation of those patterns includes crosscutting concerns. The study also asserted that the relationship between design patterns and defects varies depending on the type and the nature of the pattern.

B. CODE SMELLS AND DEFECTS
The relationship between code smells and defects has also attracted the attention of researchers. Li et al. [30] conducted an experiment to investigate, at a class level, the correlation between smells and defects. They reported that the presence of some code smells, e.g., God Class and Shotgun Surgery, is positively correlated with defects proneness, while there is no correlation for other smells, such as Data Class and Feature Envy. Furthermore, Olbrich et al. [37] monitored the development of three open source systems for 7-10 years and concluded that God and Brain classes tend to change more frequently and have more defects than other classes. However, after the values have been adjusted to the class size, the findings were reversed.
Bán and Ferenc [7] also addressed the effect of code smells on defects. Their study analyzed several systems from the PROMISE dataset and their aggregated results showed a positive correlation between the studied code smells and defects. Furthermore, Jaafar et al. [25] studied three systems, Azureus, Eclipse and JHotDraw, and reported that the majority of classes affected by code smells tend to be more fault-prone than the smell-free classes. Another large scale empirical investigation was performed by Palomba et al. [41]. The results show that smelly classes have a higher fault-proneness than non-smelly classes. A similar conclusion was reported by Nascimento and Sant'Anna [35].
On the other hand, Hall et al. [23] argued that the presence of code smells in some circumstances may indeed indicate a fault-prone code. However, the effect of those smells on the defects is rather minor. The authors also suggested that refactoring smelly classes is unlikely to reduce the number of defects in the affected code. A similar observation was reported by D'Ambros et al. [11] who concluded that none of the studied smells could be considered more harmful with respect to software defects. VOLUME 9, 2021 Tufano et al. [51] addressed the issue from a different perspective; they investigated the reasons behind the introduction of smells, and to this end they studied the change history of 200 open source projects, concluding that in many cases the refactoring and bug-fixing activities lead to the introduction of smells.
Finally, Cairo et al. [8] performed an SLR aimed at the analysis of the impact of code smells on defects. They concluded that the correlation between smells and defects is weaker than had been conjectured, and that no individual code smell could be identified as positively correlated with increased defect frequency.

C. DESIGN PATTERNS AND CODE SMELLS
In our previous study [53], we investigated the relationship between design patterns and code smells at a class level and the results indicated that pattern classes are less frequently affected by smells. The study also found that the strength of the relationship varies among different patterns and smells.
A study by Cardoso and Figueiredo [9] found two pattern-smell links: Command with God Class and Template Method with Code Duplication. The study also presented a possible explanation of those co-occurrences and offered guidelines on how to eliminate the smells from these patterns. Additionally, a recent study by Alfadel et al. [4] found that design pattern classes are less smell-prone than other classes. However, Command pattern classes are associated with God Class, Blob and External Duplication smells.
Furthermore, Sousa et al. [49] investigated if the use of patterns is negatively correlated with occurrences of code smells, and examined their collocations in the same classes. They concluded that the use of patterns does not prevent code smells. The study also reported that some patterns, such as Composite, Factory Method and Singleton, are intrinsically modular and they produce code of higher quality, while other patterns, such as Adapter, Proxy and State/Strategy, are linked with a higher frequency of code smells.
Finally, Sousa et al. presented a systematic mapping study on the relationship between patterns and smells [48]. They identified 16 papers and concluded that the misuse of certain patterns is the main cause of their co-occurrence with code smells. The paper also found that the Command pattern is correlated with the greatest number of smells.

III. EXPERIMENTAL DESIGN A. RESEARCH QUESTIONS
In the study we consider three research questions that examine the defect-proneness of pattern classes, depending on the presence or absence of code smells in them.
1) RQ1 What is the impact of code smells on the presence/absence of defects in classes involved in design patterns? 2) RQ2 What is the impact of code smells on the defect distribution (number of defects) in classes involved in design patterns?
3) RQ3 What is the effect of code smells on the relationship between specific design patterns and defects?

B. VARIABLES
The definitions of variables we use in the analysis differ, depending on the construct employed in a specific research question.

1) INDEPENDENT VARIABLES (IVs)
For RQ1 and RQ2, we consider sets of classes C, with respect to their involvement in any design pattern or if they are affected by any code smells; this produces four values of the IV: DP, nDP, SDP and nSDP; In RQ3 the definitions are different: we consider classes involved in specific patterns that are affected by any smell. That also produces four values of IVs, but separately for each DP. For example, for Decorator we have four values of the IV: DP Decorator -all classes involved in Decorator, nDP -classes not involved in any pattern, SDP Decoratorsmelly classes involved in Decorator, and nSDP Decoratornon-smelly classes involved in Decorator.

2) DEPENDENT VARIABLES (DVs)
For RQ1 the DV is a binary value {0, 1} indicating the presence or absence of defects in a class.
For RQ2 the DV is the count of defects in the class. For RQ3 we use both definitions of DVs. For RQ1 and RQ2 we consider each system separately. All subject systems are developed by communities, with their own habits, standards and routines, which cannot be directly identified. As such, they should be considered latent variables that could bias the results. For that reason, we decided to report the results separately for each system. However, for RQ3, due to the small size of data samples, we merged the systems into a single dataset.

C. NOTATION
To describe the sets that define the variables, we use the following notation: • ALL: All analyzed classes • DP: Classes involved in one or more design pattern, in any role; • nDP: Classes not involved in any design pattern; • SDP: Classes simultaneously involved in design pattern(s) and affected by code smell(s); • nSDP: Classes involved in design pattern(s) and not affected by code smell(s); • SnDP: Classes not involved in design pattern(s) and affected by code smell(s); • nSnDP: Classes not involved in design pattern(s) and not affected by code smell(s); • DEF: Classes with at least one defect; • DEF-DP: Classes involved in design pattern(s) and with at least one defect; • DEF-nDP: Classes not involved in design pattern(s) and with at least one defect;

D. SUBJECT SYSTEMS
We performed our analysis on 10 small-and medium-size Java systems coming from PROMISE [1], one of the largest public repositories of empirical software data. In this study we use one of the datasets that provides information about defects. The original dataset includes 14 open source java systems: Ant, Camel, Ckjm, Forrest, Ivy, JEdit, Log4J, Lucene, PBeans, Poi, Synapse, Velocity, Xalan and Xerces. We decided to exclude four systems: Ckjm, PBeans, Synapse and Forrest, due to the negligible number of patterns (< 5) in them.
The list of systems used in the study is presented in Table 1.

E. DESIGN PATTERNS
To identify the design patterns we used a pattern-detection tool written by Tsantalis. 1 This tool uses the Similarity Scoring Approach (SSA), which calculates the similarity between subject code and graphs representing canonical patterns.
If the score exceeds a defined threshold value, the pattern is positively identified [50]. Specifically, we used the most recent version of the tool, v4.12. It was verified against several Java systems with a 1 https://users.encs.concordia.ca/~nikolaos/ pattern_detection.html reported precision of 100% and a recall of 66.7-100 % [3], which makes its performance comparable to other approaches that use exact/inexact graph matching like Discovery Matrix (DP-Miner) [12], the sub-patterns approach [56], or metricsbased approaches, e.g., MAISA [40] and FUJABA [36].
Although the tool detects only selected GoF patterns, they cover all three categories in the GoF taxonomy: creational, structural and behavioral [18]. In Table 2 we present the list of the analyzed design patterns.
To validate the effectiveness of the automatic pattern detection, we manually verified a random sample of ca. 400 pattern classes, which constitute ca. 10% of the cases identified by the tool. The review reported no false positives.

F. CODE SMELLS
Initially, code smells were mostly subject to human intuition and experience [17]. Today virtually all flaws can be detected automatically, based on various properties of the subject code or other artifacts. An SLR presented by Guéhéneuc and Albin-Amiot [22], based on 60 studies investigating various methods of smell detection, identified only four cases in which no automated approach was used.
In this article, for detecting code smells we used inCode, a proprietary Eclipse plugin that detects the smells based on the static code analysis. The tool employs an approach  called detection strategies [34], which relies on Boolean expressions composed of selected code metrics and respective thresholds. As in any metrics-based approach, the chosen threshold values play an important role in the accuracy of the detected instances. For this study, we used the default settings of inCode, following the recommendation by Lanza and Marinescu [29].
inCode has several advantages. First, the approach it implements for smell detection is commonly used in both academia and industry; additionally, the detection strategies were found to be fairly accurate in detecting smells (≈ 70%, according to [34]), and a comparative study by Arcelli Fontana et al. [16] found inFusion, a commercial version of inCode that employs the same detection rules, to report the lowest number of false positives among four analyzed smell detectors.
In this study, we consider nine code smells (see Table 3) that represent four out of five smell categories in the taxonomy proposed by Mäntylä [33]: Bloaters, Couplers, Dispensables, and Change preventers.

G. DEFECTS
The defect dataset which we analyze in the study was acquired from the PROMISE repository. Data about defects in PROMISE has been collected by the Buginfo tool. It evaluates every commit in the repository of the analyzed system. The tool labels the commit as a bug fix if it solves an issue reported as a bug in the bug tracking system. For each analyzed project, the bug fixes commenting guidelines were  discovered and formalized as regular expressions. Buginfo compares the regular expressions with the comment associated with the commit. If the comment matches the regular expression, Buginfo reports detecting a defect and increases the defect count for every class modified in the commit [27].
The PROMISE dataset has been validated and used several times in different research papers, e.g., concerning bug prediction [14], [15].

H. IDENTIFICATION OF DATASETS
In the study we analyzed classes with combinations of various properties.
First, we collected the classes in the ALL dataset, by identifying the fully qualified class names inside specific system releases within the source PROMISE dataset. Then, based on the results of the design pattern and code smell detection, described in III-E and III-F, we identified DP and S datasets.   Since the code smells can be attributed to classes or methods, while patterns involve classes, we had to adjust the granularity of the datasets to the class level by re-assigning the method-level smells to the enclosing classes.
In the next step we identified intersections of the sets to produce SDP, SnDP, nSDP and nSnDP. The resulting datasets were disjoint and complete, i.e., each class was reported exactly once in all datasets, and no class was omitted. However, defects in the PROMISE dataset are assigned to files, which can contain one or more classes. To address the issue of several classes being included in a single file, but assigned to different datasets, we manually verified and excluded such cases.
The replication package is available at https:// zenodo.org/record/3992730#.X2NZamgzY2x  In this section we check if design pattern classes are associated with the presence or absence (considered as a binary value) of defects in them. For this purpose, we used the Odds Ratio (OR) test [46] to find the associations between the presence of patterns as an exposure and the presence of defects as an outcome. The OR function is specified as follows: OR = a/c b/d where: • a = number of exposed cases • b = number of exposed non-cases • c = number of unexposed cases • d = number of unexposed non-cases The result of the OR test is interpreted as follows: • OR = 1 The exposure does not affect the odds of the outcome, • OR > 1 The exposure is associated with higher odds of the outcome, VOLUME 9, 2021   Table 12 together with Hedges' g results.
• OR < 1 The exposure is associated with lower odds of the outcome. Next, we use Fisher's exact test (FET) to determine if the OR results are significant [44]. The results are presented in Table 5 and the extracted association rules are summarized in Table 6.
The extracted rules indicate that patterns are positively associated with the presence of defects. This is in line with the results of other studies, e.g., [19], and in contradiction to the common understanding of patterns in terms of their positive impact on code quality [43]; on the other hand, they reinforce our conjectures concerning contextual factors that may play a role in this association. Our next step is to investigate the role of code smells as a confounding factor. Table 7 presents the results from the OR and FET tests and the extracted rules are listed in Table 8. For this question we are interested whether the presence of patterns in classes is associated with a higher/lower number of defects, and what the effect of smells as a contextual factor in this relationship is. To answer this, we first tested the normality of distributions in the subject datasets. In Table 9 we present the descriptive statistics of the datasets and in Table 10 we report the results of the Shapiro-Wilk test [45]. They show that both pattern and non-pattern values are not normally distributed.
Since all values are not normally distributed, we used a nonparametric Wilcoxon-Mann test (WMW) [32] to verify if two populations have different medians of defect distributions. In this section we verify the hypothesis that pattern and non-pattern classes do not differ with respect to defects. Analogously, similar hypotheses are used when comparing any other two groups using WMW throughout this study. 1) H0: DP = nDP w.r.t. defects 2) Ha: DP = nDP w.r.t. defects 3) Ha1: DP < nDP w.r.t. defects 4) Ha2: DP > nDP w.r.t. defects The results of the WMW test are presented in Table 11. As follows from the results, DP > nDP for 9 out of the 10 analyzed systems, and the only remaining case is inconclusive. We repeated the same steps to measure the effect of smells on these results. In Table 12 we summarize the WMW test results between the smelly-and non-smelly pattern classes, and between the classes in those two groups and the non-pattern classes. We also present the extracted rules from this analysis in Table 13.
To assess the significance of our extracted rules, we performed an effect size analysis to measure the mean difference between the different groups. We used the Hedges' g [24] test with a corresponding 95% confidence interval (CI). Hedges' g provides a measure of the effect size weighted by the relative size of each sample. The results are interpreted according to Cohen's d conventions [10]: The results and interpretation are reported in Tables 11  and 13.

C. RQ3. WHAT IS THE EFFECT OF CODE SMELLS ON THE RELATIONSHIP BETWEEN SPECIFIC DESIGN PATTERNS AND DEFECTS? 1) THE BINARY RELATIONSHIP
In this section we investigate the binary relationship between individual patterns and defects and how the presence of smells impacts this relationship. As the number of specific patterns is too low in each system, we merged the datasets. In Table 14 we present the descriptive data on the resultant dataset.
Next, we performed again the OR test for each pattern and applied FET to measure the significance of the results (see Table 15).
From the results we can conclude that the classes involved in Adapter, Decorator, Proxy and State are more defect-prone than non-pattern classes. The Visitor pattern appears to be associated with the absence of defects, but due to a large p-value we consider the result to be uncertain.
We followed these steps by measuring the effect of smells on these extracted associations and for that we repeated the same tests taking into consideration the effect of smells (see Table 16). The results suggest that the positive association between Adapter, State, and Visitor patterns with defects exists only if they are affected by smells.

2) THE DISTRIBUTION OF DEFECTS
In this section we investigate whether specific patterns attract more or fewer defects than non-pattern classes, and how the introduction of smells affects this distribution.
In Table 17, we present the results of the WMW test that compare the distribution of defects for specific patterns with the distribution of defects in non-pattern classes.
According to the results, the Adapter, Decorator, Prototype, Proxy and State patterns are more defect-prone than non-pattern classes, while the results for the other patterns are inconclusive. We also performed a Hedges' g test and the results show that the extracted rules have a different significance depending on the pattern type. They report that none of the extracted rules has a large effect size and that the significance varies between medium, small or negligible, depending on the type of the pattern.
To investigate the effect of smells on the previous rules, we performed a similar analysis for the classes with code smells. In Table 18 we present the results of the WMW test that compares the defects distribution in the smelly and non-smelly classes involved in patterns, with a distribution of defects in non-pattern classes. Table 19 reports the extracted rules from the WMW test and the significance of those extracted rules based on yhe results of the Hedges' g test.

A. RQ1: WHAT IS THE IMPACT OF CODE SMELLS ON THE PRESENCE/ABSENCE OF DEFECTS IN CLASSES INVOLVED IN DESIGN PATTERNS?
If we consider the binary relationship between patterns and defects (i.e., defective and defect-free classes) the results show that in five of the analyzed systems (Ant-1.7, JEdit-4.2, Camel-1.6, Ivy-2.0 and Velocity-1.6), the presence of patterns is positively associated with the presence of defects. For the other five systems we could not find any significant rules. It is also important to point that no rule that contradicts the extracted rules was identified, so in no system could we relate patterns to the absence of defects. After the introduction of the effect of smells to the analysis, we found that three out of five systems that were found to have a positive relationship between patterns and defects exhibit this relationship only if the patterns are affected by smells (JEdit-4.2, Camel-1.6, Ivy-2.0 and Velocity-1.6), and for the other two (Ant-1.7 and Camel-1.6) the relationship exists regardless of the presence of smells in the patterns.
In the remaining five systems, for which we initially could not reject the null hypothesis, patterns in Poi-3.0 were found to be positively related with defects only if they were smelly; in Xerces-2.0 the effect was even more evident: smelly patterns have been positively associated with defects, while the association for non-smelly patterns was negative.
By including code smells we did not only extract more rules, but we also found that in the majority of systems, patterns are positively associated with defects only when they are smelly. On the other hand, the results for non-smelly patterns are mixed, so while they are positively associated with defects in a couple of systems, they also have negative or no relationship with defects in other systems. This observation provides a possible explanation for the mixed [52] or small VOLUME 9, 2021   relationship [39] reported in the literature between design patterns and defect-proneness. The presence of code smells appears to be a factor that interacts with design patterns and has a decisive impact on defects in the subject code by amplifying the previously existing defect-proneness. Consequently, it has a practical consequence for software developers. The intense use of patterns can lead to their interactions and the proliferation of cross-cutting effects [6], resulting in some types of code smells. That, in turn, could effectively diminish or revert the expected advantages of applying design patterns, even if the pattern classes attract fewer smells than the non-pattern ones [53]. The results for nine out of ten of the analyzed systems indicate that design pattern classes are linked with a higher number of defects than the non-pattern classes. Only in the case of Xalan-2.7 no significant rules were identified. The effect size analysis reported that the mean difference between the smelly and non-smelly patterns in terms of defects is between [0.2-0.5] of standard deviation, which entails that the significance of those extracted rules are either small or medium, depending on the system. By introducing information about code smells into the analysis, we obtained new insights into those results. First, in the majority of systems (seven out of ten), smelly patterns have a higher number of defects than the non-smelly patterns, and no system produced contradictory results. The effect size for the majority of those extracted rules is large, indicating that the difference between the two groups is of a large significance.
With regard to the effect of smells on the relationship between pattern vs. non-pattern classes with defects, the extracted rules were difficult to interpret, because while smelly patterns are associated with more defects than non-pattern classes in eight systems, the non-smelly patterns also have more defects than non-pattern classes in six systems. Those results initially suggested that the smelliness  of a pattern is not a valid contextual factor for analyzing defect-proneness. A thorough analysis of the extracted rules shows that the extracted rules for the relationship between smelly patterns and defects are stronger than those which show the relationship between non-smelly patterns and defects. All the rules extracted for the smelly patterns are significant at α = 0.01, while only two rules are significant at the same level for the non-smelly patterns. The effect size analysis also strengthens this conclusion, since for the smelly pattern rules the mean difference is large enough to be of a practical significance, while the effect size for the non-smelly pattern rules is either small or even negligible.
To have a more comprehensive understanding of the results and to isolate the effect of smells, we again performed a WMW test to compare the smelly vs. non-smelly patterns with smelly vs. non-smelly non-pattern classes. The results are summarized in Table 20.
The results confirm our initial observations that smelly patterns, in the majority of systems, have a higher defect distribution than the non-pattern classes, regardless of whether those classes are smelly or not. On the other hand, the comparison of non-smelly patterns with smelly and non-pattern classes showed that they have similar defect distribution in the majority of the systems and that non-smelly patterns have a higher defect distribution only in case of Camel, while they have a lower number of defects in case of Xalan and Xerces. This may suggest that the effect of smells and patterns on defects is cumulative and the results of comparing data that belong to only one group (smells or patterns) could be attributed to other contextual factors. It is also worth mentioning that the effect size analysis of the relationship between nSDPvs.SnDP shows that the significance of the mean difference between those two groups is small or even negligible. The results also suggest that classes which are not participating in a pattern and are not affected by smells tend to attract fewer defects than pattern classes.
Finally, the results of the effect size analysis strengthen our conclusion, as they show the large significance of the extracted rules related to smelly design patterns, while they demonstrate the small significance of the rules related to the non-smelly patterns.

C. RQ3: WHAT IS THE EFFECT OF CODE SMELLS ON THE RELATIONSHIP BETWEEN SPECIFIC DESIGN PATTERNS AND DEFECTS?
In the subsequent sections we discuss the binary and the cumulative relationships between specific patterns and defects, and the effect of smells on these relationships.

1) THE BINARY RELATIONSHIP
We are interested in analyzing the binary relationship between specific patterns and defects, and in describing how the presence of smells affects this relationship. However, the dataset has a very small number of instances for some patterns, e.g., Chain of Responsibility (3 instances, all of them are smelly), Bridge (18 instances, only three of them are smelly), Observer and Prototype (14 instances each, only 2 instances in each case are smelly). Because of the small sample size, FET reported the insignificance of the extracted rules even if they were supported in 100% of cases. Nevertheless, the detailed analysis reported some significant associations such as the Adapter, Decorator, Proxy and State patterns are positively related with the presence of defects, while the presence of the Visitor pattern is associated with the absence VOLUME 9, 2021 of defects. The Visitor case contradicts our findings reported in Sec. IV-A and contradicts our findings for all other patterns in this section.
The case of Visitor is unique: among 38 instances, 12 of them are defective and 26 are defect-free. All of them come from a single system, velocity-1.6, and are located in a single package org.apache.velocity.runtime.parser. node. All instances represent objects that visit and parse a specific type of a node. As the amount of the code inside the Visitor pattern is minimal, and its logic is clear and simple, no defect was reported for the majority of those instances. In the remaining classes, defects were cosmetic or related to special cases which have not been covered.
After introducing the effect of smells to our analysis, the results showed that in the case of Adapter, State and Visitor, smelly patterns are positively associated with the presence of defects. However, the Decorator and Proxy patterns require the investigation to be replicated on a larger dataset: although the presence of smells in the pattern classes was associated with defects in 100% of the cases, the small number of smelly classes (1 for Decorator and 3 for Proxy) invalidated the FET results. Furthermore, we found that the non-smelly Visitor classes are associated with the absence of smells. These results are consistent with our findings reported in Sec. IV-A.
On the other hand, for both the Decorator and Proxy patterns, the results suggest that smell-free pattern classes are also associated with the presence of defects, which compels us to conduct further investigation. For the other patterns, the small number of detected instances prevented us from extracting any rules.
For the Decorator pattern, the majority of defective non-smelly instances belong to two systems, Xalan-2.7 and Camel-1.6. In both systems the evolution of the Decorator pattern scattered its functionalities into many small objects representing crosscutting concerns, which in turn became hard to comprehend and maintain, and as a consequence, produced defects.
With regards to the Proxy pattern, there are 23 defective non-smelly instances, and 16 of them belong to Xalan-2.7. Almost all the Visitor instances in this system belong to a single package org.apache.xalan.xsltc.compiler. Those instances parse specific types of instructions before passing the parsed segments to a converter object. Those instances extend a single parent class, Instruction, and their tight coupling with this shared parent causes them to also share the same defects.

2) THE DISTRIBUTION OF DEFECTS
The results of the detailed analysis are consistent with our findings reported in IV-B. They indicate that the Adapter, Decorator, Prototype, Proxy and State patterns are linked with more defects than non-pattern classes. While we could not extract any rules for other patterns, no extracted rules in any of the patterns contradicted our findings. The introduction of smells into our analysis resulted in the observation that, in case of the Adapter, Bridge, Observer, Singleton, State and Visitor patterns, smelly patterns attract more defects than non-smelly patterns. For all other patterns, we could not extract any rules that contradict our findings.
With respect to the effect of smells on the relationship between patterns and defects we found that for the majority of pattern types (Adapter, Bridge, Observer, Prototype, Proxy, Singleton, State and Visitor), the smelly design patterns attracted more defects than non-pattern classes. The effect size analysis reported that the mean difference between those smelly patterns and non-pattern classes is greater than the 0.8 of standard deviation, which indicates the large significance of those extracted rules. For the non-smelly patterns, we found that only in the case of Adapter, Decorator, Proxy and State did the non-smelly patterns attract more smells than the non-pattern classes and those extracted rules have a small significance. Moreover, in case of the Visitor pattern, we concluded that non-smelly patterns have fewer defects than non-pattern classes.
The defect proneness of specific smells and patterns has also been studied in the literature. Our findings partially confirm the results reported by Aversano et al. [6], Vokáč [52] and Onarcan and Fu [39] with respect to the Singleton and Observer patterns and their positive association with defects. Our results show that for both of them the presence of smells additionally amplified the defect proneness of the affected classes. For other patterns, the results differ, which may also indicate the confounding role of code smells addressed in our work.

VI. THREATS TO VALIDITY
A. CONSTRUCT VALIDITY Due to limitations related to the granularity of the PROMISE dataset, our analysis had to be performed at the class level. This inevitably affected the results, as some code smells have been reassigned from methods to classes.
The other issue concerns the detection tools used to detect both patterns and smells. Although those tools were chosen for their high precision, recall and accuracy compared to other detection tools [3], [16], we have not performed cross-validation with other tools, just manual verification for a sample of detected patterns. Our results are as accurate as those tools are.
Finally, in our analysis we consider the presence of smells as a binary variable, regardless of their types and number. Research on relationships among various code smells shows that collocated smells seem to have more detrimental impact on quality than individual smells [5]. This issue could be further elaborated in replication studies.

B. EXTERNAL VALIDITY
Our exploratory analysis was performed on 10 small-and medium-size Java systems, and our findings should be interpreted accordingly. Furthermore, for some patterns we have only a small number of detected instances, which limits extrapolating the results beyond the scope of the study.

C. CONCLUSION VALIDITY
Most of our findings were driven and validated statistically, so our conclusion is highly related with the accuracy of the statistical tests which we used. The non-normal distribution dictated that we use non-parametric tests which have a lower statistical power than parametric tests. Additionally, the size of the subject projects makes us cautious when evaluating the impact of the results.

VII. CONCLUSION
In the paper we investigated the links between design patterns and defects, and how the presence/absence of smells affects these relationships. Our analysis included 10 smalland medium-size Java systems. The findings suggest that pattern classes are associated with more defects than non-pattern classes, and that smells could be considered as a contextual factor in this relationship since smelly pattern classes attract more defects than both non-smelly pattern and non-pattern classes.
The paper findings are three-fold: • Investigating the binary relationship between patterns and defects showed that patterns are positively associated with the presence of defects, thus validating the results reported in previous studies. However, by including the presence of smells as a confounding variable in this relationship, our results indicate that only smelly patterns have a unanimously positive association with defects, while non-smelly patterns delivered mixed results.
• Our results show that pattern classes have a greater number of defects than non-pattern classes. Introducing the effect of smells into the analysis reveals that smelly classes attract more defects than non-smelly classes, and that both smelly-and non-smelly pattern classes have, in a different rate, a higher defect distribution than non-pattern classes. The findings also suggest that the relationship between smelly patterns and defects is more significant than the relationship between non-smelly patterns and defects.
• The relationship between specific patterns and defects varies, both in terms of the binary and quantitative relationships. This variation still holds true if smells are introduced into this relationship. Nevertheless, there are some common findings between all the patterns. For example, our analysis did not reveal a pattern that attracts a lower number of defects than non-pattern classes. In contrast, the Adapter, Decorator, Prototype, Proxy and State patterns tend to have a higher defect distribution than non-pattern classes. The introduction of smells into the analysis showed that the majority of smelly pattern classes attract more defects than non-pattern classes, and that non-smelly patterns attract more or fewer defects, depending on their type. For example, the non-smelly Adapter, Decorator, Proxy and State classes attract more defects than the non-pattern classes. On the other hand, smelly Visitor classes are linked with a lower defect distribution than non-pattern classes. What is also noticeable in our results is that no non-smelly pattern attracts a higher defect number than a smelly pattern. On the contrary, smelly Adapter, Bridge, Observer, Singleton, State and Visitor classes tend to have more defects than non-smelly pattern classes. Furthermore, the binary association between different patterns and defects also varies between patterns. The Adapter, Decorator, Proxy and State patterns are associated with the presence of defects, while the Visitor pattern is associated with the absence of defects. Taking into consideration the effect of smells on the previous findings showed that the majority of smelly patterns tend to have positive associations with defects, but with a different confidence and significance. The results, albeit preliminary, can inspire and foster further research on the contextual factors that affect defectproneness, changeability and other important software properties. Understanding their role may help in isolating their individual impact and the interactions they play a role in.
Our findings can have an impact on the development of practice. Design patterns promote good practices. However, if pattern classes are affected by code smells, the advantages of patterns could be challenged by defects resulting from their interaction with smells. Therefore, we conclude that preventing and removing code smells may reduce the defect-proneness of the code, so we advise programmers to take this possibility into account.
In the future, we plan to replicate this experiment on a larger scale and to investigate the individual effect of specific smells on the relationship between patterns and defects.
TAREK ALKHAEIR received the master's degree in computing from the Poznań University of Technology, Poznań, Poland, and the master's degree in web science from Syrian Virtual University, Syria. He is currently pursuing the Ph.D. degree with the Poznań University of Technology. He is currently a Software Engineer and a Researcher. He also works as a Senior Software Engineer with Roche Poland. His research interests include code quality, design patterns, and code smells.
BARTOSZ WALTER received the Ph.D. degree from the Poznań University of Technology, Poland. He is currently affiliated to his Alma Mater, where he is also an Assistant Professor, and also works as a Senior Researcher with the Poznań Supercomputing and Networking Center, Poznań, Poland. His research interests include software maintenance and evolution, as well as several aspects of object-oriented design and software testing. VOLUME 9, 2021