1. Introduction
Many university students struggle with academic pressure, interpersonal relationships, and uncertainty about their future, which are manifested as mental health problems, study anxiety, and poor study habits, among others. These issues affect the growth and development of the student [1, 2].
There is a trend of using machine-learning algorithms for analysis and prediction in the context of early warnings for students in higher education. In addition to the support vector machine (SVM) algorithm, other machine learning algorithms such as random forest and backpropagation neural networks have been applied in this field [2–5]. However, in practical applications, the results of different algorithms may vary, and there are some limitations in existing research [6, 7].
The motivation in this study was to address the problems faced by college students, whether in terms of academic, mental health, or social issues [8], by using a combination of SVM and genetic optimization algorithms to improve the accuracy of student warnings. Unlike previous studies, this method combines multiple results to predict the status of students more comprehensively. The unique feature of this study is that it not only focuses on academic and mental health issues but also on social issues, to help better support the students' development.
2. Related Studies
2.1 Support Vector Machine Principle
In this study, we chose the most widely used radial basis kernel function, which is expressed as
where g is the kernel parameter, representing the width of the radial basis kernel function action, and C is the penalty factor. Therefore, to improve the accuracy of the machine learning warning model in this study, the parameters should be chosen appropriately.
2.2 Support Vector Machine Optimization Model based on Genetic Algorithm
In this study, genetic algorithms were used to select suitable parameters, whereby the parameter selection process was optimized to build a more accurate machine learning warning classification model. By combining traditional SVMs and genetic algorithms, this study proposes an early warning learning model based on improved SVMs, with objective function as follows:
where l represents the number of samples, [TeX:] $$y_i$$ represents the actual value, and [TeX:] $$f\left(x_i\right)$$ represents the predicted value.
2.3 Existing Research
According to existing studies [9, 10], supervised learning is the most commonly used data mining technique to solve problems related to the classification of mental health problems. The most commonly used algorithms include SVMs, followed by decision tree and neural networks. All three models have a high degree of accuracy, which is in excess of 70%, and good generalization ability, which can prevent overfitting.
After testing, genetic algorithm-optimized support vector machine (GA-SVM) performed the best in the training and testing scenarios on the same dataset. The results showed that genetic algorithms can effectively search the hyperparameter space of an SVM and determine the optimal hyperparameter configuration, thereby improving the performance of the classifier. In addition, the random nature of genetic algorithms enables them to escape local optima, making them more likely to find the global optimum and enhance the generalization ability of the model
3. GA-SVM Early Warning Forecasting for Students in Higher Education
Given the abstract, nonlinear, and categorical nature of the student warning problem and small sample size, a SVM algorithm was used to solve the classification problem. SVMs have many advantages, such as neither being affected by sample size nor being prone to overfitting. Therefore, this study used a SVM algorithm for model training.
3.1 Characteristics of Early Warning Models for Students in Higher Education
This study relied on publicly available open-source datasets of student mental health, and a combination of extensive literature research, student interviews and teacher recommendations was used to screen for the characteristics of learning crises. After grouping these characteristics into mental health, academic, and social components, the characteristics of these three components were combined.
In accordance with the principle of “considering the causes, capturing the key elements, reducing the cost of prediction, and facilitating problem solving,” each indicator element was refined to an easily measurable level. This downgrading process made the final indicator element more operational, laying the foundation for subsequent information collection. After several screenings, 16 key characteristics of crisis generation that affect learning were extracted, as shown in Table 1. The data for these indicators can be easily obtained from open-source datasets.
Features of a student warning model
Implementation flowchart of GA-SVM student warning prediction model.
3.2 Learning Early Warning Model based on GA-SVM
According to actual needs, the input and output variables of the model (the presence of student warning) were determined, and the data of 4,700 current junior students were exported through each system, processed, and merged. The optimal hyperparameters were confirmed using a genetic algorithm, and the SVMs was trained using the results of the best search. The model was trained on the training set; the reliability of the model was verified using the validation set; and finally, 20% of the test set was used for comparisons with other models. The implementation of the GA-SVM predictive student-warning model is shown in Fig. 1.
3.2.1 Data selection and pre-processing
Input and output variables: The 16 impact factors mentioned above were chosen as input variables, with yes and no constituting the output variables. Yes means that the student has an early warning situation; that is, they may have psychological, academic, or social problems. No implies that there are no such problems. In the dataset, data that lacked factors, such as null sleep quality, were deleted. After data removal, the cumulative grade point average (CGPA) parameters were classified as follows: CGPA greater than 3.0 as A; CGPA less than 3.0 but greater than 2.5 as B; CGPA greater than 2.0 but less than 2.5 as C; and CGPA less than 2.0 as D. In this way, the characteristics were converted into numeric types. Finally, all datasets were subjected to one-hot encoding, which converted all categorical variables into vector form.
Training and test data selection: To run the model more easily and ensure its accuracy, the samples were first screened to remove those with unusable or missing data. Subsequently, through crossvalidation, 80% of the data were used as the training set, of which 20% were selected as the validation set, and the remaining 20% were used as the test set, as shown in Fig. 2. This helped improve the reliability of the model.
3.2.2 Model parameter settings
Fitness function: In our implementation of the genetic algorithm, the fitness function was used to evaluate the performance of each individual. Specifically, we employed a fitness function based on classification accuracy. This implied that the fitness of each individual is calculated based on its accuracy in the classification task. The formula for the fitness function is expressed as follows:
This ensures that individuals with higher fitness levels perform better upon completing the classification task.
Encoding: In our genetic algorithm, each individual was encoded using a real-number encoding method. This means that the chromosome of each individual consists of a sequence of real numbers, each representing different hyperparameters, such as the penalty parameter C and the kernel function parameter γ of SVM. This encoding method not only enhances the precision of parameter search but also makes the algorithm more flexible and efficient.
Parameters of the crossover operation: This study adopted a uniform crossover strategy to perform crossover operations. In this process, two parent individuals are selected and genes from the parent individuals are randomly exchanged based on a specified crossover probability, thus generating new offspring individuals. We set the crossover probability to 0.6, meaning that there was a 60% chance of selecting genes from one parent and a 40% chance from the other parent. This method has the advantage of maintaining diversity within the population, while also promoting the transfer of useful traits.
3.2.3 Model prediction performance evaluation criteria
The performance of the model was evaluated using accuracy, precision, recall, and F1-score, which better reflect the performance of the classification model as they take into account the classification accuracy of the model, as well as false positives and false negatives.
4. Experimental Cases
4.1 Analysis of Forecast Results
In this study, Python was used to optimize the SVM with the genetic algorithm, running it several times to obtain the best parameters after each iteration. This process is illustrated in Fig. 3.
Fitness curve plot for genetic algorithm optimization.
For dichotomous classification problems, the confusion matrix provides a more intuitive picture of the model classification. In the confusion matrix diagram, true positive (TP), false positive (FP), true negative (TN), and false negative (FN) are represented by different colors or patterns. Through confusion matrix plots, the model performance in each category can be visualized in relation to error type. The plot of predicted sample and true value comparison provides a visual representation of the prediction. The results are shown in Figs. 4 and 5.
GA-SVM confusion matrix plot.
GA-SVM receiver operating characteristic.
4.2 Verifying the Validity of the Prediction Results
For a fair and effective comparison, this study employed commonly used hyperparameter settings to train the random forest algorithm, multilayer perceptron (MLP), extreme gradient boosting (XGBoost), decision tree, and k-nearest neighbors (KNN) models. This means that the hyperparameters of all the models were not specifically optimized, thereby ensuring that the comparison results more accurately reflect the performance of each model under standard configurations. This approach not only ensures the simplicity and reproducibility of the experimental design but also provides a balanced benchmark for assessing the effectiveness of the SVM learning early warning model optimized using a genetic algorithm, in comparisons with other standard machine learning methods. In this study, these models were trained on the same data samples, using the same test set for prediction, to compare based on accuracy, precision, recall, and F1-score. The score results of all models are shown in Table 2. Model parameters are compared in Figs. 6 and 7.
Performance of different models
Comparison of different models: (a) accuracy, (b) precision, (c) recall, and (d) F1-score.
Compared to the prediction results of other models, the genetic algorithm-optimized SVM had a better predictive effect on students' learning status, followed by XGBoost, with the worst being KNN. Although there may be some uncertainty and error in the prediction accuracy under the influence of small sample data size, the prediction results of the genetic algorithm-optimized SVM can have a certain reference value in judging students' learning status, based on the comparisons in a previous article.
Performance comparison of models.
5. Concluding Remarks
In summary:
1) Using genetic algorithms can effectively solve the hyperparameter selection problem and provide randomness.
2) The results of comparing the two different prediction models show that the SVM learning warning model based on genetic algorithm optimization has higher prediction accuracy and smaller error. Thus, it can be used to determine the learning status of students, which has certain application value in predicting student status.
3) This study further emphasizes the significance of optimization algorithms, particularly in the context of hyperparameter selection. Our findings demonstrate that the use of genetic algorithms not only addresses the challenges associated with selecting appropriate hyperparameters but also introduces an element of randomness that is crucial in navigating the complex landscape of parameter tuning. This approach has proven to be particularly effective in enhancing the performance and reliability of machine learning models.
In addition, future research may focus on the following areas.
· How to identify student problems more accurately, rather than dichotomizing to yes and no.
· How to choose more general factors for prediction so that the model can be generalized.
· How to develop more efficient optimization algorithms to reduce optimization time. Genetic algorithms require longer time to optimize, sacrificing time in exchange for performance.