1. Introduction
Detecting defects, adopting corrective measures and providing preventive solutions are essentials of software development. When done in a coherent and methodological way, this not only improves the reliability of the software, but also helps in reducing the development costs and further enhancements [1]. However, many factors associated with software code development make defects inevitable. Thousands of lines of code, sourced by a team of coders are highly susceptible to defects. Use of third party source codes, such as functions, subroutines, libraries, etc., also adds to defect vulnerabilities. Besides, an existing code subjected to several modifications and enhancements to meet the new criteria and/or to enable new functionality also have high possibility of defect occurrence [2]. However, a significant reduction in the defects can be accomplished with the aid of defect detection solutions. While conventional approaches may favor a critical analysis of the code by segmentation, choosing an advanced programming language, improve developer training etc., there exist alternative automated ways [3]. One such is the use of metrics and machine learning techniques to build predictive models which help identify defects leading to fault prone modules with a certain level of confidence [4-8].
Along these guidelines that machine learning algorithms would suit the need, we attempt to identify defects in web applications using object oriented metrics design suite [7,9,10]. We adopt to the most popular 14 machine learning techniques and apply the methodology on to three releases of Apache Click and four versions of Apache Rave web application projects. The results are evaluated using area under curve (AUC) obtained from the receiver operating characteristics (ROC) analysis. [11].
In general, machine learning models have been extensively used across various disciplines with varying degree of successes. It has become obvious that the choice of the metrics is crucial, and an optimized metric set not only provide faster results but also provide better accuracy and reliability. For the same, we have used the filter methodology [12,13] as implemented in Weka3.7 [14], based on correlation based statistics. We note that the correlation based feature selection technique have been widely used across various disciplines [15-17], thereby finding it to be universally acceptable.
Primarily, the present work emphasizes on the statistical analysis of the metrics distributions across the various releases and, using rule and ensemble based machine learning techniques to identify faultprone classes in Apache datasets. Of much importance, we also find that the metrics identified using the correlated feature selection renders better defect prediction models.
Beyond, minimal information exists on how a particular machine learning algorithm depends on the nature and distribution of the chosen metrics data. This is partly evident from the variations in the prediction by various machine learning algorithms on a given dataset and also by a particular algorithm on statistically different datasets. Likewise, we also note that the varying degree of performance could also well depend on the choice of the metrics, as well. In these perspectives, we have considered machine learning techniques that are based on parametric, non-parametric and ensemble algorithms.
Our discussion of the algorithm and methodology is detailed in Section 4. A priori, in Section 2 we first summarize the related work in defect detection using predictive techniques which form the motivation to the current work. In Section 3 we outline our current research work with details of the independent and dependent variables, selection of applications, procedure of the dataset collection and description, machine learning techniques and their performance indicators. Section 5 details the result with discussions, and in Section 6 we test the validity of the approach. Finally, in Section 7 we summarize our work, stating future directions.
2. Literature Review
A wide range of statistical and machine learning models exist to predict defect modules in a given software. Statistical techniques such as univariate and multivariate logistic regression (LR) and machine learning techniques such as artificial neural network (ANN), support vector machines (SVM), Bayesian network (BN) and many more have been proposed [18-20]. The correlation between software metrics and fault-proneness had also been studied using many models [7,10,21]. Arisholm et al. [22] compared variants of Decision Tree (DT) techniques with neural networks, SVM and LR techniques on the Java Telecom system and found the DT based technique (C4.5) to yield better results. Consistent with the earlier findings of Lessmann [4], the authors suggest that the choice of the classification algorithm for fault proneness is seldom important. We note that the work of Lessmann [4] was based on the traditional McCabe [23,24] and Halstead [25] metrics and, used analysis of variance (ANOVA) for statistical comparison of classification models. Earlier, in their review on software fault prediction studies, Catal and Diri [26] emphasized on the need for more studies using class-level metrics and machine learning algorithms. Their work also emphasized that fault proneness prediction studies provide more useful information with public datasets.
Object-oriented metrics used in the study
De Carvalho et al. [18] using multi-objective particle swarm optimization (MOPSO-N) technique [27,28] with six C&K design metrics (refer Table 1 for definitions) found that RFC, WMC, CBO and LCOM are the important object oriented metrics for indicating fault in a class. The results were compared with seven other machine learning methods using Wilcoxon test [29]. The authors observed that the results generated with the MOPSO-N technique was at par with the ANN and BN techniques, and the SVM algorithm yielded the lowest performance. On the other hand, Singh et al. [30] using a similar set of object oriented metrics found that SVM technique is rather a robust technique for fault prediction. Nevertheless, a consensus that emerged from either works was that the NOC metric could not be considered as a reliable feature for fault prediction. Similar conclusion on the relevance of NOC metric was also pointed out by Gyimothy et al. [5] and Olague et al. [31]. We also note that the irrelevance of NOC metric in fault proneness was found by univariate analysis [31] and not by any feature selection method.
Further, Catal et al. [19] used NASA KC1 data set to analyze the artificial immune recognition system (AIRS) and Bayesian approach, for fault prediction. Although the authors conducted no statistically significant tests, they selected the features using the popular correlation based feature selection method. The most salient finding was, that CBO was identified as an important metric for fault prediction. On the other hand, the study by Pai and Dugan [32] showed that apart from CBO, SLOC, WMC, and RFC were also equally significant, and that neither DIT nor NOC were significant. The significance of LCOM, however, appeared to be model dependent.
Kanmani et al. [33] compared ANN techniques with that of statistical techniques in the software system written in the Java language. The findings of the study revealed that neural network based fault prediction models perform better than the statistical techniques. Azar and Vybihal [34] found ant colony Optimization (ACO) technique to be better than both decision tree (C4.5) and random guessing techniques using C&K metrics. The Wilcoxon test was used for comparison. Di Martino et al. [35] configured SVM with a genetic algorithm for prediction of faulty classes on the basis of object oriented metrics and compared the results with optimization of SVM using Grid search. Their results showed that the genetic algorithm yielded better results for configuration of SVM parameters.
Okutan and Yildiz [36] used Bayesian networks to evaluate the relationship between C&K metrics and defect proneness. It was found that NOC and DIT are not effective metrics for defect prediction, but LOC, CBO, RFC and WMC play an important role in identifying faults-prone classes. Zhou et al. [37,38] utilized C&K design metrics of NASA data set to establish their relation with fault-prone classes when fault severity is taken into account. Their findings indicated that the design metrics were able to predict low severity faults better than high severity faults in fault-prone classes. D’Ambros et al. [8] evaluated various defect prediction approaches across different systems. However, the authors expressed the need for more detailed case studies on different datasets as the external validity in defect prediction was found to be difficult. Bowes et al. [39] introduced mutation-aware fault prediction models using LR, RF, NB and J48 and indicated that the best performance is obtained using a combination of both static and dynamic mutation metrics. However, the performance of the classification models was measured using Mathews correlation coefficient (MCC). In a recent work, Malhotra and Raje [40], investigated the Android dataset to predict defective classes using the object oriented metrics. Their findings showed that Ce, LOC, LCOM3, CAM and DAM to be significant predictors and that the naïve Bayes algorithm was identified as an important machine learning algorithm.
Thus, on a very general consensus, it appears that no generalization could be derived on the choice of the machine learning algorithms, choice of the features either by feature selection techniques or univariate analysis, for fault proneness. Thus, it becomes quite essential to perform more investigations on varying datasets, both public and private. To the best of our knowledge, none of the above studies have been conducted on the widely used web application framework like Apache, using algorithms that are based on statistical, rule-based and ensemble machine learning techniques. In this study, we analyze the relationship between object oriented metrics and machine learning techniques using web applications. The performance of 14 machine learning techniques (see Table 2) has been assessed and compared for defect prediction in classes of web applications. The statistical tests have been performed to obtain the statistical significant differences among the machine learning techniques on various releases of Apache Click and Rave dataset.
3. Research Background
3.1 Independent Variables
The independent variables of this study are object oriented design metrics suite computed on each Java file of the project using the defect collection and reporting (DCRS) [41] which has been developed in the Java programming language at the Delhi Technological University. The metrics used in the study are listed in Table 1.
3.2 Dependent Variable
The dependent variable analyzed in this study is the identification of fault prone classes. It represents the likelihood of defects in a class after the release of the software. The observation of classes which are found to be defective helps in the competent allocation of constraint resources during testing.
3.3 Selection of Applications
As mentioned earlier, we focus on identifying fault prone classes of web applications. In order to develop reliable predictive models, one needs multiple versions of the application with moderate number of classes. The study uses Apache Click and Apache Rave open source projects developed under the Apache Software Foundation (ASF) process. The ASF projects reliably link Git commits to closed bugs in the issue tracker, resulting in high quality data for building defect prediction models. The Apache Click and Apache Rave are large web projects developed using Java with more than three hundred classes in each release and with at least three releases. The Apache Click is a J2EE web application framework providing an easy to learn client style programming model. On the other hand, Apache Rave aims to provide a social mash-up engine to support web widgets for internet as well as intranet, and is in its early development phase.
3.4 Feature Selection
Feature selection is the process of selecting the most discriminatory features out of the available ones [42] and is considered as a crucial procedure in machine learning problems. While for an accurate and precise predictions all features may look important, in general it may serve as an inappropriate methodology yielding poor outcomes. For instance, a large feature set will certainly make the problem computationally cumbersome. Beyond, a raw collection of features also may lead to information redundancy and increase the complexity of the prediction models. For the entire process of machine learning aided predictions, cost effectiveness demands an optimization effort in data acquisition and processing, prior subjective to the prediction models. It is now well known that features which are correlated in the input dataset not only lead to ambiguous predictions, but also affect the generalization capability of the machine learning algorithms.
In general, there exist two mechanisms for feature selection. They are the wrapper and the filter based methods. While, the wrappers use the classifier at hand to select the feature subset, the filter method optimizes the features independently of the classifier. In fact, the filter methods which are independent of the classifiers either use the probability based distance approaches such as the Bhattacharyya distance [43], the Chernoff distance [44], the Patrick Fisher distance [45], or the correlation based approach [12,13]. The choice of feature selection, however, depends on the problem at hand. It has been discussed previously that since the correlation based feature selection makes use of all the training data at once, it can give better results than the wrapper on small dataset [12,13]. In other words, a feature selection method which would render high reliability in detecting defects in web applications is preferred to have the following characteristics: (i) it should not only scale, but also must lead to high predictability for a large number of web applications, (ii) it should be independent of having an explicit class labeling, (iii) since classification of the web metrics is not the goal, the feature selection process should not assume the use of a specific classifier, and (iv) it should have the good performance among the methods satisfying the above conditions.
Since, the present study of fault proneness predictions rely on Apache dataset using a variety of machine learning algorithms, we adopt to the filter methods. Further, we also note that a comparative study of 32 feature selection methods on defect prediction performance had been carried out by Xu et al. [46] using feature ranking, wrapper based and, filter based feature evaluation techniques. The authors found that CFS unequivocally yields the best performance.
3.5 Performance Indicators
A variety of performance indicators, such as confusion matrix, gain and lift chart, Kolmogorov chart, Gini coefficient, concordant-discordant ratio, ROC, root mean squared error, etc., have been used to evaluate the predictive capability of models developed using machine learning techniques. In general, the defect dataset has a disproportionate ratio of faulty and non faulty classes and is imbalanced in nature. The ROC is the commonly used performance measure to deal with the imbalanced property of the dataset. The ROC curve represents the correctly predicted faulty classes (sensitivity) on the ycoordinate versus the one minus the percentage of correctly predicted non-faulty classes (1-specificity) on the x-coordinate. The optimal cutoff point that maximizes both sensitivity and specificity is determined using the ROC curve. The comparative performance analysis of each machine learning technique is evaluated using ROC curves.
The AUC is the value of the area under the ROC curve and, its value lies between zero and one. It is a combined measure of sensitivity and specificity and, is used to compute the accuracy of the predicted models. The higher the value of AUC, better is the predictive capability of the model. The AUC is insensitive to the effects of noise and imbalanced dataset. Hence it is advantageous to use AUC for performance evaluation of the predictive models.
3.6 Validation Methods
The practical understanding on the accuracy of the model can be predicted by applying it to the different data sets other than from which it is built. Therefore, we performed a 10 cross-validation of the models. Each dataset is randomly divided into 10 equal subsets. Each time one of the 10 subsets is used as the test set and the other 9 subsets are used to form a training set. The process is repeated 10 times and the results from all the folds are combined to produce model result [47].
3.7 Machine Learning Techniques
We have used machine learning techniques for building prediction models. A set of feature vectors (object oriented metrics described in section ‘Independent Variables’) and the corresponding labels (either faulty or non-faulty) are used as the training set to build the fault prediction model. The model is then applied to a different set of feature vectors called the testing set. The classification of classes with the corresponding faulty or non-faulty labels for the testing is compared with the real labels to compute the performance indicators as explained in Section 3.6.
Machine learning techniques used in the study
The performance of machine learning techniques depends on the properties of the data to be classified. In Table 2 we present the summary of machine learning techniques used. The experiments are conducted with a Weka3.7 tool to build the predictive models by using machine learning techniques implemented with the default parameter settings. Quite differently, few recent studies [35,48,49] have emphasized on the importance of parameter tuning using heuristic techniques like genetic algorithms and differential evolution. It is argued that such tuning techniques can provide better prediction results [35,48,49]. Nevertheless, it has been also stated by Fu et al. [48], that parameter tuning is required to be repeated for any change in data, and that different tuning algorithms result in different optimized parameter values. Therefore, the parameter tuning technique eventually leads the defect prediction model to be short in attaining universality. Also, it may also be noted that the tuned parameter technique as mentioned in [48] is likely to overstate the results, if the goals are improperly defined. Beyond, Arcuri and Fraser [50] have shown that parameter tuning has very sensitive effects on the external validity of the results by using search based techniques. Thus, given that parameter tuning addresses a defect detection problem on a very local scale, we adapt to the default parameters as supplemented by the Weka suite of programs, so as to have a wider applicability, reproducibility, interdata comparison and generality in web applications.
3.8 Statistical Testing
The statistical difference between various machine learning techniques is computed using Friedman test [51]. It is a non parametric test, used to rank a set of techniques over multiple data sets. The Friedman test is based on two hypotheses:
Null Hypothesis (Ho): There is no significant difference between the performances of the compared techniques.
Alternative Hypothesis (H1): There exists a significant difference between the performances of the comparative techniques.
The Friedman measure is defined as follows:
where R is the individual average rank (1, 2…., k), n is the number of data sets and k is number of compared techniques. The value of Friedman measure is distributed over (k-1) degrees of freedom. If the value of Friedman measure is in the critical region (obtained from χ2 with a specific level of significance, i.e., 0.01 or 0.05 and (k-1) degrees of freedom), then the Null hypothesis is rejected and it is concluded that there is a difference between the performance of comparable techniques, else Null hypothesis is accepted. If the Null hypothesis is rejected after applying the Friedman test, we perform post-hoc analysis using Nemenyi test [52]. It is a non-parametric test that performs pairwise comparisons of the difference in performance of the techniques. The critical difference (CD) is calculated using the following formula. The SPSS version 16 for Windows (SPSS Inc., Chicago, IL, USA) is used for applying Friedman and Nemenyi tests.
3.9 Data Description
The class-defect characteristics of the three releases of Apache Click and four releases of Apache Rave web applications are provided in Tables 3 and 4, respectively. The respective tables, lists the number of classes of each version, size, number of faults, faulty class percentage and the name of the software along with the release of the software under which fault was fixed to the immediate subsequent release.
Apache Click data set characteristics
Apache Rave data set characteristics
3.10 Data Collection Method
In order to collect data points from each software project, the source code of different releases of Apache Click and Rave applications developed in Java language has been obtained from GitHub repository https://github.com/apache/click and https://github.com/apache/rave, respectively. The faults were collected from the defect logs by using DCRS [40], which mines the change logs of two predetermined consecutive releases of software. In this study, defects incurred from the immediate previous release and the subsequent ones are taken. The collected faults are then mapped to the classes in the source code. We also collected a binary variable named “FAULTY” which is true (“YES”) if the count of the total number of faults in the class is non-zero, or false (“NO”) otherwise.
4. Research Methodology
In this section, we elaborate on the approach that has been used in this work in order to achieve the prediction of fault in a class using object oriented metrics. Following are the necessary steps (depicted in Fig. 1), which we incorporate in our approach for model prediction:
• The change logs maintained by different software repositories corresponding to different software are analyzed.
• The object oriented metrics and fault data is extracted from the reports using DCRS module.
• The faults are associated with the corresponding classes of the software module.
• The fault prediction models are built by applying various machine learning techniques in order to conduct an extensive empirical study for prediction of faulty classes.
• The models are validated using 10-fold cross method.
• The proposed models are evaluated using appropriate performance evaluation measures.
Schematic representation of the research methodology adopted in this work.
4.1 Research Questions
We investigate the following research questions:
• RQ.1: Which object oriented metrics serve as good indicators of faults in a class?
• RQ.2: What is the overall performance of the statistical and machine learning techniques for the prediction of fault prone classes on Apache Click and Apache Rave datasets?
• RQ.3: Which is the best predictive technique for identifying fault prone classes?
• RQ.4: Which pair of machine learning techniques is significantly different from one another for prediction of fault prone classes in web applications?
4.2 Descriptive Statistics
The maximum (max), minimum (min) and mean values for each object oriented metric from the selected versions of Apache Click and Apache Rave projects, are shown in Tables 5 and 6, respectively. We attempt to make a qualitative inference on the nature and impact of the object oriented metrics from the data shown in Tables 5 and 6. In general, a high value associated with WMC has been anticipated to yield more faults [53]. It may be noted that there exists no well defined WMC limit values for fault predictions. However, it is evident from Tables 5 and 6, that with regard to WMC metric, Apache Click is anticipated to have a lesser fault proneness in comparison to Apache Rave. For Apache Click and Apache Rave, the WMC data suggest a spread over the range 0–95 and 0–142, respectively, although with a comparable mean value.
The values associated with DIT are found to be less than the recommended value of 5 [54]. A high DIT is anticipated to increase faults. From the dataset, we find a maximum (minimum) of 3 (2) for Apache Click, while for Apache Rave it has been determined as 4. These values empirically suggest that DIT metric may not be quite detrimental to this case study. Apart from DIT, which measures the depth of inheritance, an important and closely associated metric is the NOC, the latter which measures the breadth of the class hierarchy. The dataset shows that Apache Click has larger NOC (11), in comparison to Apache Rave (3). In general, high NOC is found to indicate fewer faults.
In the C&K metric suite, the number of classes to which a class is coupled is determined by the CBO metric. High CBO is found to be undesirable, as excessive coupling between classes prevent reuse. From Tables 5 and 6, we find the maximum value of CBO associated with the Apache Click to be 13, while for Apache Rave versions 0.12–0.13 and 0.16–0.17 as 9, and 7 for 0.20.1–0.21.1 releases. A high value of 27 is determined for the latest version 0.22.1–0.23, suggesting it to be highly fault prone with regard to the CBO metric. However, in comparison to the data shown in Tables 5 and 6, we find that the fault class percentage associated with Apache Rave version 0.20.1–0.21.1 is highest (96.26%), which is in contrast with the empirical predictions.
Statistical description of Apache Click dataset
Statistical description of Apache Rave dataset
Studies also reveal that the number of public methods (NPM) used, also effectively serve as a good indicator to fault predictions. From the works of Shah et al. [55], it has been found that for medium and large softwares categorized by its size, NPM plays a significant role. Our dataset shows that NPM varies between 87 and 91 for Apache Click versions, while being 124 among the Apache Rave versions. These high values of NPM are suggestive that the respective class may be split for optimal performance [55]. Among the other metrics, proposed by Bansiya and Davis [9], for fault proneness are DAM and, CAM which also serve as good indicators. In case of DAM, which are in the range [0, 1], a high value is generally desired. We find that the average value of DAM for the latest two versions of Apache Click and Apache Rave is approximately 0.6 or above. Similarly the average value of CAM, the statistical mean is determined to be 0.6 or above for both Apache Click and Apache Rave. It may be noted that the preferred value of CAM is close to 1.
Few other metrics also indicate that Apache Rave is relatively more fault prone than Apache Click. For example, the RFC, which represents the response function of a class is found to be 143 for Apache Rave, while 94–98 in the Apache Click. In general, classes with high RFC pose complexity in reading, testing and debugging. Although, no value makes a quantitative judgment on fault proneness with respect to the RFC metric [53], in the present case the high RFC values associated with Apache Rave certainly indicates to its instability with respect to Apache Click application.
LCOM is yet another metric that help in determining the fault proneness. Based on the nature and applicability of the object oriented suite of programs, four variants of LCOM have been proposed. Here, we emphasize on LCOM and LCOM3. Following Tables 5 and 6, we find the LCOM to be as high as 4000 or more for Apache Click, and more than 10000 for Apache Rave. However, when one considers the average value, LCOM shows a higher value for Apache Click (approximately 130) than for Apache Rave. The latter shows an increase in the mean value varying from (LCOM) mean = 85 for version 0.12–0.13 and (LCOM) mean = 114 for the 0.22.1– 0.22 versions. Here also, a high LCOM indicate to greater fault proneness. However, it may be noted that the validity of LCOM to be used as an indicator metric for fault proneness has been criticized previously [32]. For instance, it has been argued that for classes which use data that are generated by its own properties is likely to show high LCOM values. Such situations certainly are not problematic. A work around was to redefine the LCOM metric, which originally was based on the method-data interaction. The expression to calculate LCOM3 is given as,
where m and a are the number of procedures (methods) and variables (attributes) in a class. The quantity “mA” represent the number of methods that access a variable. In the above expression, mA is summed over all attributes of a given class. It is seen that LCOM3, for both Apache Click and Apache Rave, varies between 0–2. For LCOM3 = 0, it suggest to cases where each method access all variables, indicating highest possible cohesion and LCOM3 = 1 is suggestive of lack of cohesion of methods.
5. Analysis and Results
In this section, we present the results of the empirical comparison of machine learning techniques in terms of the AUC. The classifier models have been developed using independent variables described in Section 3.1. The independent variables were selected through the CFS technique to obtain better results. Table 7 presents the relevant metrics found in each release of Apache Click and Apache Rave datasets after applying the CFS technique. The results show that LCOM3, WMC, NPM and DAM were the most commonly selected object oriented metrics over the various releases of the Apache Click and Apache Rave data sets.
As discussed earlier, the machine learning classifiers were empirically evaluated using the AUC, which is capable of dealing with noise and unbalanced data [4]. Table 8 lists the 10-fold cross-validation results of 14 machine learning techniques on three and four releases of Apache Click and Apache Rave, respectively. The machine learning technique yielding relatively better AUC values, for a given version, is highlighted in bold. The results show that the prediction efficiency of the model using the MLP, LR, Bagging, and AB techniques have AUC greater than 0.6, corresponding to most of the releases of the Apache dataset. That, the statistical and ensemble based methods perform well in fault proneness predictions also have been emphasized by Ghotra et al. [56]. The results demonstrated the findings using the NASA and PROMISE corpus dataset. Overall, this level of accuracy is also consistent with the findings of Menzies et al. [6], which reports that defect predictors are useful for identifying fault prone modules.
We note that various machine learning techniques predict the fault proneness of the Apache Click versions, quite well. The poor fault proneness rendered to the intermediate versions of the Apache Rave accounts to the limited number of features selected by the CFS scheme. Note that only NPM and WMC were found prominent for fault proneness by the CFS for Apache Rave versions 0.16–0.17 and 0.20.1– 0.21.1, respectively. Thus, the results which indicate to only one feature selection for these intermediate versions of Apache Rave suggest a strong correlation between the features, which are problematic and harder to judge.
As evident from the results listed in Table 8, one finds that the relative performance of the machine learning algorithms is small across various versions of the Apache dataset. In order to verify that the observed performance differences between predictive models are not random, we choose Friedman test. Note that, the null hypothesis for Friedman test states that all machine learning classifiers are equivalent and hence their ranks should be equal. However, the Friedman test resulted in χ2 value of 38.13 and FF value of 4.32 for 14 machine learning algorithms (k = 14) on the seven Apache dataset (N = 7). For a twotailed test at the 0.05 level of significance, the critical value of Fk-1, … (k-1)(N-1) is determined to be 1.848. Thus, the null hypothesis is rejected. The average rank of each machine learning classifier is provided in Table 9. It suggests that MLP is the best technique for the development of fault prediction models for Apache dataset. Our findings are consistent with the works of Gyimothy et al. [5]. The models developed using rule based algorithms, such as DT and J4.8, were found to perform relatively poor.
Friedman test results of 14 machine learning techniques
In general, our findings corroborate with those of Ambros et al. [8]. In the latter, the authors using a regression model on Apache Lucene find that LR when applied to CK metric set gives an AUC value of 0.721. Consistently, our LR analysis on the Apache Click dataset gives an average of 0.734, while for Apache Rave the average AUC value across the four versions was determined to be 0.617. However, our results spans over 14 machine learning techniques, of which we find MLP yields the best performance. That, network based MLP is suited best for fault prediction has also been emphasized by Malhotra and Raje [40].
Following, we proceed with Nemenyi post-hoc test to detect fault prediction classifiers which differ significantly. As mentioned above, the Nemenyi post-hoc test compares all pairs of different classifiers and checks which model’s performance differs significantly, i.e., exceed the CD) The Nemenyi test CD came out to be 5.353 at the significance level of 0.05. The results of the pairwise comparisons of the 14 ML techniques are shown in Table 10. The values which exceed the CD are highlighted in bold.
Nemenyi post-hoc test results of 14 machine learning techniques
The results of Nemenyi test show that out of 14 machine learning techniques used in the study, the performance of J48 is significantly poor than LR, MLP, AB, LB and Bagging. Also, we find poor performance of VP, SMO, DT with LR, MLP, AB and LB. Therefore, we identify that statistical, MLP, and ensemble base approaches performed significantly better than rule based machine learning algorithms like J48, VP, SMO and DT. However, we find that the experimental data is not sufficient to reach any conclusion regarding the RandTree, REPTree, RF, NB, BN algorithms.
6. Threats to Validity
It is important to be conscious of the threats to the validity of the results obtained by conducting an empirical study in software engineering. The results obtained cannot be generalized as they depend on large number of project and environment specific context variables. In this study, we have analyzed seven releases of web applications with 14 machine learning techniques. One possible source of bias is the data used in the study. The data has been collected using DCRS tool and is placed on web for replication and comparison with other experiments. The set of object oriented metrics selected for this study is based on previous experiments [5,31,40]. The researchers may select different metrics collection for their studies.
The selection of applications for study considered number of classes and size of the code, which may be different for other researchers. The selection of classifiers is another possible source of bias. We have considered 14 machine learning techniques and there are still others that could have been studied. Our selection is guided by the aim of finding a meaningful balance between established techniques and novel approaches. We believe that the most important representatives of different domains (statistics, machine learning, and so forth) are included.
7. Conclusion and Future Directions
The underlying objectivity of the research is to comprehensively compare the performance of 14 machine learning techniques for fault prediction in web application, associated with the Apache Click and Apache Rave projects using object oriented metrics. En-route to the prediction of the defect proneness using various machine learning algorithms, which are based on parametric, non-parametric and ensemble based, an independent basis set was first refined using correlation based feature selection method. The models were thereafter validated using 10-fold cross method and were evaluated using the AUC performance measures. The main findings of the work are summarized below:
1) The LCOM3, WMC, NPM, and DAM object oriented metrics are found to be the significant predictors by using CFS, over the three and four releases of the Apache Click and Apache Rave data sets, respectively.
2) With its AUC being greater than 0.6, the work affirms the overall predictive ability of the MLP, LR, Bagging and AB techniques for fault prediction.
3) Following the Friedman test results, MLP appears as the most qualified methodology towards a quality fault prediction for Apache Click and Apache Rave dataset. Furthermore, the statistical post hoc Nemenyi test, indeed validates a significant pair wise difference between the performances of MLP with other machine learning techniques.
Hence, we conclude that machine learning models developed for fault prediction in this work can be successfully used for identifying faults in the subsequent releases of the Apache web application dataset. It is anticipated that these models could be also applied to different projects that are similar in nature. So as to derive an universality in default prediction across various dataset, we plan to carry out model predictions on search based techniques with different language environments, in future. A selection and detailed investigation of inter-project training data for cross project validation, is also proposed.