Article Information
Corresponding Author: Joon-Min Gil** , jmgil@cu.ac.kr
JongHyuk Lee*, Dept.of Artificial Intelligence and Big Data Engineering, Daegu Catholic University, Gyeongsan, Korea, jonghyuk@cu.ac.kr
Mihye Kim**, School of Computer Software, Daegu Catholic University, Gyeongsan, Korea, dhkim@cu.ac.kr
Daehak Kim*, Dept.of Artificial Intelligence and Big Data Engineering, Daegu Catholic University, Gyeongsan, Korea, mihyekim@cu.ac.kr
Joon-Min Gil**, School of Computer Software, Daegu Catholic University, Gyeongsan, Korea, jmgil@cu.ac.kr
Received: October 1 2018
Revision received: February 8 2019
Revision received: July 10 2019
Revision received: October 21 2020
Accepted: December 22 2020
Published (Print): June 30 2021
Published (Electronic): June 30 2021
1. Introduction
In recent years, the field of data analytics in education is attracting increasing attention with the rapid growth of educational data and the spread of big data technology. With the increasing use of online and software-based learning tools, the amounts and types of educational data have increased greatly, and new methods are required to analyze them. Thus, educational data mining [1] and learning analytics [2] are being expanded to explore large-scale data generated by universities and intelligent tutoring systems to better understand students; all of data mining, machine learning, and statistics are employed to this end.
The prediction and prevention of student dropout are very important in terms of continuing education. In other words, we seek not only to identify students at risk of leaving school but also to understand why dropout occurs, which will aid the design of future educational policies. Although many educators have statistically explored relationships between student lifestyle factors and dropout rates, they have focused on ensuring that students graduate, rather than an accurate prediction of dropout. As a result, there are many kinds of researches [3-12] on the features affecting the student dropout and the evaluation of predictive models. However, it is also required the study of the new features and the performance evaluation results when they are reflected in the predictive model. In this paper, we find the new features that prevent the students from dropping out and evaluate the models that predict the dropout candidates so that they can get appropriate measures.
Data mining extracts important patterns or knowledge from large amounts of data and is used in retailing, finance, telecommunications, education, fraud detection, stock market analysis, and text mining. Here, to predict and prevent student dropout, we developed and evaluated predictive models using various data-mining methods, including logistic regression (LR), a decision tree (DT), a naïve Bayes (NB) method, and a multilayer perceptron (MP). In particular, we focused on features that exert major influences on dropout; we used analysis of variance to this end and created an optimized model via feature selection. As a result, we found that engagement in extracurricular activities is an important feature to prevent students from dropping out. The results of MP among the four data-mining methods were the best in terms of performance metrics such as F-score and area under the curve.
The remainder of the paper is organized as follows: in Section 2, we present related work on dropout prevention; in Section 3, we introduce the architecture that we used for model creation and evaluation. Section 4 describes the four methods used to generate predictive models, and Section 5 describes the evaluation of the models. Finally, Section 6 contains our conclusions and future plans.
2. Related Work
In data mining, classification is a problem of determining which category a new observation belongs to. Classification is generally regarded as supervised learning using predefined classes. On the other hand, clustering corresponds to unsupervised learning, which groups objects without prior knowledge of classes. Classification and clustering are data-mining techniques by which data are grouped into several classes, and it has many applications. Classification and clustering models can be used to identify tumors [13], fingerprint large numbers of people [14], evaluate employee performance [15], explore student dropout [16-19], monitor anti-phishing [20], and DoS attack detection [21]. Several studies have focused on the causes and features of dropout in terms of prevention, construction of predictive dropout models, and development of dropout prevention systems.
Hoff et al. [3] considered various variables affecting dropout, discussed procedures and tools for the prevention of dropout, and showed examples of early warning systems termed EWIMS (Early Warning Intervention and Monitoring System) [4], DEWS (Dropout Early Warning System) [5], and NDPC-SD (National Dropout Prevention Center for Students with Disabilities) [6]. The variables used to identify dropouts and trigger preventative procedures were attendance, behavior, course performance, race, ethnicity, socioeconomic status, disability status, grade retention, school climate, engagement, and mobility. In this paper, we discovered new variables such extra-curricular activities as to identify dropouts. Yukselturk et al. [7] investigated the applicability of data-mining techniques (k-nearest neighbors, DT, NB, and neural networks) in predicting dropout among online students. Although the differences did not attain statistical significance, the k-nearest neighbors and DT classifiers were somewhat more sensitive than the other models. Manhaes et al. [8] developed an architecture using EDM techniques (an NB model, an MP, a support vector machine, and DT) to predict student dropout. The use of time-varying data aided the prediction of student achievement. The true-positive rate of the NB model was the highest among the four techniques. Guarin et al. [9] evaluated the data-mining models featuring an NB approach and a DT to predict dropout among students with low academic achievement. Dropout prediction performance was improved when academic data were included. Omoto et al. [10] reviewed institutional research (IR) and Fujitsu trends. IR involves the measurement of many activities via oncampus data collection and analysis, planning of appropriate measures, and implementation and verification of management improvements, student support, and higher-quality educational techniques [10]. Fujitsu developed several statistical methods for the analysis of trends in quality improvement and dropout prevention. Support vector machines were used in data mining. Kuznar and Gams [11] developed a Metis system for the prediction of student dropout and prevention of associated negative consequences. The Metis system uses machine learning algorithms to analyze data from school information systems, identifies students who are likely to drop out, and triggers appropriate action from educational experts. Costa et al. [12] compared the effectiveness of four data-mining techniques (an NB method, a DT, a neural network, and a support vector machine) in predicting the likelihood that students would leave the Brazilian Public University introductory curriculum on programming. The support vector machine performed better than the other techniques.
This study is similar to related work in that we evaluated predictive models using data-mining methods as shown in Table 1. We also chose methods generally used for classification in the related work. However, it differs from related work in that we used new features such as extra-curricular activities and generated a model after selecting the optimal features thereof via LR.
Data-mining methods used for identifying dropouts
3. System Architecture
Our dropout prevention system architecture is divided into five layers (collection, storage, processing, analysis, and visualization), as shown in Fig. 1.
Collection layer: In this layer, data are collected from various systems inside and outside the organization. Our collector uses various interfaces (e.g., REST, HTTPS, SFTP) to connect different databases and files within the school affairs, learning, and library systems; and logs and documents.
Storage layer: In this layer, the collected data are held permanently or temporarily in distributed storage. The HDFS and NoSQL approaches are used to permanently store large files and large amounts of messaging data, respectively. Personal data are anonymized.
Process layer: In this layer, the data are formalized and normalized to render them suitable for analysis. For example, the learning management system processes unstructured log files to generate structured data, such as the numbers and times of logins per student.
Analysis layer: In this layer, new patterns in large datasets are sought and interpreted to provide novel insights. For example, a predictive model is created to detect students who are likely to drop out.
Visualization layer: In this layer, the big data results are presented in an easy-to-understand manner. For example, new student information is entered into a predictive model to explore whether particular students dropped out.
Fig. 2 illustrates the model creation and deployment process. Here, we selected significant features with the aid of the chi-squared test and analysis of variance (ANOVA) and generated models using four data-mining methods. ANOVA [22] is a method used when comparing two or more groups in statistics. In this paper, we try to find features that affect the dropout group through the ANOVA. This process is necessary for improving the prediction performance and reducing the learning time. When model performance exceeded the desired threshold, the best model was deployed, and new student data were input to identify students who might drop out.
Model creation and deployment.
4. Classification Methods
We predicted student dropout by analyzing existing data to create a predictive model and then entered new student data into this model to predict dropout. Such data analysis creates a classifier predicting whether dropout occurs (e.g., yes or no). We used LR, a DT, an NB method, and an MP to this end. Although examples for using student data in each method are briefly described in this section, experiments, where actual student data is used on a variety of cases, are shown in the next section.
4.1 Logistic Regression
LR is a well-known classification method for the derivation of relationships between dependent and independent variables, like linear regression (which explains a dependent variable as a linear combination of independent variables), but differing in that LR uses categorical (discrete) dependent variables, rather than numerical (continuous) variables. For example, assuming that student dropout can be categorized into two states (yes or no) by grade point average (GPA), the independent variable is the GPA and the dependent variable is the dropout. When we analyzed the relationship between the GPA and the probability of dropout (Fig. 3), that probability decreased gradually as the GPA rose from zero to a certain point, then rapidly decreased, and later gradually decreased further. Thus, we performed LR analysis employing the logit (or log-odds ratio) to describe the curve mathematically. The following equation is a logistic function created by the logit:
where the values are regression coefficients [TeX:] $$\left(\beta_{0}, \ldots, \beta_{i}\right) \text { and } x$$ values are independent variables [TeX:] $$\left(x_{0}, \ldots\right.\left.x_{i}\right).$$
The odds ratios generated by LR can be used to determine the extent to which independent variables affect dependent variables. Generally, LR analysis is used when the dependent variables fall into two categories; multinomial LR is employed when two or more categories are in play, and ordinal LR is used when the dependent variable is sequential.
The relationship between the GPA and the probability of dropout.
4.2 Decision Tree
A DT is a predictive model linking attributes (independent variables) to a class label (the dependent variable). In a DT, an internal node (that is not a leaf node) is used to test the value of an attribute, and this information is used to branch to another internal or leaf node, ultimately determining a class. A DT learns by induction, employing training data with class labels analyzed with the aid of algorithms such as ID3, C4.5, or CART. The algorithms differ in terms of the attribute selection methods (e.g., information gain, gain ratio, and Gini index) used to identify criteria that divide the training data well. A DT is constructed as follows.
First, an appropriate split criterion and a stopping rule are defined, depending on the purpose of the analysis and the data structure. Here, the split criterion was a classification tree emphasizing the purity of the child node at the expense of that of the parental node. Parameters sharing high-level purity are more likely to belong to the same category. The split criterion varies depending on the type of dependent variable. For example, when the dependent variable is discrete, the p-value is used for the chi-squared test, the Gini index, and the entropy index. When the dependent variable is continuous, the F statistic is used for analysis and reduction of variance. Here, we employed the Gini index because the dependent variable (dropout) was discrete. The Gini index is based on a binary split of all attributes. In terms of the discrete value attribute, the subset with the lowest Gini index is selected. For continuous attributes, all possible separable values are considered. The independent variable with the smallest Gini index forms a branch of the DT. The following equation is used to obtain the Gini index:
where is the dataset, and is the probability that a tuple in belongs to the i-th class. For example, if the two categories are in a ratio of 0.8:0.2, the Gini index is [TeX:] $$1-\left(0.8^{2}+0.2^{2}\right)=0.32.$$ Fig. 4 shows an example of a DT considering the only GPA as an independent variable.
The tree is pruned because non-standard learning data (noise and outliers) may be present and overfitting is possible. The pruned tree is smaller, less complex, and simpler than the original tree.
The DT is evaluated against the test data using a cross-validation method.
4.3 The NB Approach
The NB approach is based on Bayes’ theorem, as follows:
where H and X are events,[TeX:] $$P(H \mid X)$$ is the posterior probability of H given that X is true, P(H) is the prior probability of H, [TeX:] $$P(X \mid H)$$ is the posterior probability of X given that H is true, and P(X) is the prior probability of X. The NB approach assumes that the value of any variable is independent of the values of other variables (i.e., lack of dependency among features). For example, independent variables (terms, or the GPA) that affect student dropout are independent; each independent variable is assumed to contribute uniquely to the probability that a student will drop out. An NB classifier is generated as follows:
If a tuple X with a class label (a dependent variable indicating whether or not dropout occurs) is composed of n attributes (independent variables), that tuple is expressed as a vector [TeX:] $$X=\left(x_{1}, x_{2}, \ldots\right.\left.x_{n}\right).$$ We need to find the class label with maximum [TeX:] $$P\left(X \mid C_{i}\right) P\left(C_{i}\right)(i=1,2)$$ for two labels: [TeX:] $$C_{1}$$ (enrolled students) and [TeX:] $$\mathrm{C}_{2}$$(expelled students).
When the elements of the vector X are continuous, the continuous value attribute is assumed to follow a standard Gaussian distribution with a mean [TeX:] $$\mu$$ and a standard deviation [TeX:] $$\sigma.$$ The probability of the standard normal distribution is as follows:
Therefore, [TeX:] $$P\left(x_{k} \mid C_{i}\right)$$ can be written as:
When the tuple X satisfies the following expression condition, it is considered to a class label [TeX:] $$C_{1}:$$
4.4 The MP
An MP is used to model the neurons of the human brain involved in learning. Several hidden layers lie between an input and an output layer and identify data that are not linearly separable. Recently, artificial neural networks featuring MPs have been termed deep neural networks, and the algorithm used to study such networks is said to evaluate engagement in “deep learning.” The value of an independent variable is input in an input layer node. Usually, the numbers of independent variables and input nodes are identical. The outputs of the input layer are used as inputs in hidden layer nodes. To calculate outputs of the hidden layer, a weighted sum of inputs is calculated and delivered to an activation function that may be linear, exponential, or sigmoid. Here, we used a sigmoid logistic function. The MP is constructed as follows:
Initialization is performed by assigning connection weights to arbitrary values, calculating the inputs to each layer for a set of learning data (e.g., GPA, terms, etc.), and, finally, calculating outputs employing the activation function.
After comparing the outputs to the expected values, the connection weight is adjusted via backpropagation to ensure that outputs lie within the error limit.
This procedure is repeated using other learning data and terminated when the differences between the outputs and the target values are within the acceptable error range.
Here, we used an input layer in which the numbers of nodes and independent variables were equal; thus, 14 nodes with two hidden layers, and one output layer node (Fig. 5).
Although the MP usually shows better predictive performance than LR, the extent to which the input affects the output is difficult to determine. Therefore, we used LR to identify variables that should be preferentially considered. In addition, we selected, from the four methods, that method with the best real predictive performance.
5. Experiments
5.1 Experimental Environment
We implemented a collector based on Sqoop (https://sqoop.apache.org) for the collection layer in our system architecture. Using the collector, we collected data from our university system and stored it in HBase (https://hbase.apache.org) for the storage layer. The data size is several hundred MB, and the number of samples is about fourteen thousand. Two categories for enrolled students and expelled students are in a ratio of 8:2. Table 2 shows variables affecting dropout in our experiment. We selected thirteen independent variables out of a total of 155 variables, Next, we used Spark (https://spark.apache.org) for the process layer to clean the dependent and independent variables. The cleaned training and test data were randomly divided at a ratio of 7:3. Finally, using the Spark for the analysis layer, we created and evaluated LR, DT, NB, and MP models.
Variables and data cleaning
As shown in Table 2, we used the independent variables from five cases to explore how the evaluations changed according to the characteristics and numbers of independent variables employed for dropout prediction.
Case #1: GPA
Case #2: Case #1 + {age, semester, sex}
Case #3: Case #2 + {engagement in club activities}
Case #4: Case #3 + {nationality, parental address, extracurricular activity score, number of volunteer activities, number of surveys completed evaluating satisfaction with extracurricular activities, number of surveys completed evaluating satisfaction with the department, number of consultations, and engagement in freshman camp activities}
Case #5: Case #3 + {extracurricular activity score, number of surveys completed evaluating satisfaction with extracurricular activities, number of surveys completed evaluating satisfaction with the department, number of consultations, and engagement in freshman camp activities}
Case #5 was derived from Case #4 (i.e., full feature set) by excluding independent variables (i.e., nationality, parental address, and number of volunteer activities) that did not significantly affect dropout, as revealed by ANOVA (Table 3). ANOVA can be used to improve model performance when an independent variable is included, depending on the difference between the null and residual deviance. We found that engagement in extracurricular activities significantly reduced dropouts.
5.2 Experimental Results
We used the accuracy, precision, recall, F-score, and area under ROC curve (AUC) parameters to evaluate the four models, as follows. In particular, we selected the F-score and AUC because the class ratio of the experimental data is imbalanced.
To facilitate comprehension of the above equations, Table 4 shows the confusion matrix. The true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are defined as follows.
TP: The model predicted that students dropped out, and they did in fact drop out.
TN: The model predicted that students did not drop out, and this was in fact the case.
FP: The model predicted that students dropped out, but they did not drop out (type I error).
FN: The model predicted that students would not drop out, but they did drop out (type II error).
Both TP and TN (only) must be in play if a predictive model is to match all real truths, associated with an accuracy of unity. In this sense, precision is a measure of the extent of type I error, and recall is a measure of the extent of type II error.
Fig. 6 compares the four methods in terms of accuracy. The NB model was less accurate than the other models; the application of the MP method to Case #5 yielded the greatest accuracy (i.e., 0.95). The accuracies of the LR and MP methods increased as more independent variables were added, but the accuracies of the DT and NB methods did not.
Comparison of the four methods in terms of accuracy.
Fig. 7 compares the four methods in terms of precision. The NB method was less precise than the other methods; the application of the DT method to Case #5 yielded the greatest precision (i.e., 0.91). Thus, the use of the NB method resulted in more type I errors than did the use of other methods. The precisions of the LR, DT, and MP methods increased as more independent variables were added, whereas the precision of the NB method did not. The weakness of the NB method is that the dependence between independent variables has a relatively poor predictive performance. As a result of this experiment, the dependency between independent variables such as (age, semester) and (engagement in club activities, extracurricular activities score) seemed to have influenced the performance of the NB method.
Comparison of the four methods in terms of precision.
Fig. 8 compares the four methods in terms of recall. The degree of recall was lower for the NB method than for the other methods for Cases #1, #2, #3, and #4; however, the degree of recall for the NB method for Case #5 was the highest (i.e., 0.92). Thus, type II errors created using the NB method were significantly reduced by optimizing variable selection (i.e., from Case #4 to Case #5). The degrees of recall of the LR, NB, and MP methods increased as more independent variables were added, whereas the degree of recall of the DT method did not.
Comparison of the four methods in terms of recall.
Fig. 9 compares the four methods in terms of the F-score. The F-score was lower for the NB method than for the other methods. As shown in Fig. 9, the highest F-score was 0.87, obtained by using the MP method to analyze Case #5. Fig. 10 compares the four methods in terms of AUCs. As shown in Fig. 10, the highest AUC was 0.98, obtained when the MP method was used to analyze Case #5. Thus, the predictive model generated by analyzing the independent variables of Case #5 via the MP method showed the best performance. However, the results of the MP method do not differ much from those of the LR method. The LR method is similar to the one-layer neural network and divides the pattern space linearly into two regions. On the other hand, the MP method, which is a two-layer neural network used in this paper, divides the pattern space into convex regions, which is theoretically better than the LR method. We leave it to future studies that the MP method yields much better results than the LR method.
Comparison of the four methods in terms of the F-score.
Comparison of the four methods in terms of the AUC.
6. Conclusions and Future Work
Here, we used LR, a DT, an NB model, and an MP to create predictive models that might provide information for the prevention of student dropout. Predictive models using independent variables selected with the aid of variance analysis and the MP method showed the best performance (the F-score and AUC were 0.87 and 0.98, respectively).
We will improve the performance of the MP model and apply the optimized model to our school management system to better prevent dropout. We will counsel students who are at risk (as revealed by data analysis), and establish a data-driven campus management plan embracing student guidance, the living environment, and campus activities.
Acknowledgement
This work was supported by research grants from Daegu Catholic University in 2017.