1. Introduction
As a pillar industry of the country, automobile industry is a powerful driving force for the development of national economy. In recent years, with the prosperity and development of the automobile industry, automobile credit business has also developed rapidly. It is estimated that in 2018, the penetration rate of automobile credit market has reached 43% [1], and the automobile credit group is mainly the post-90s generation, showing an important feature of younger gradually. However, with the development of automobile credit business, the problem of default in automobile credit has gradually emerged.
Regarding to the study of the causes of automobile credit default, the domestic and foreign scholars mainly focus on two aspects. For example, from the perspective of external economic environment, Wack [2] points out that the more subprime mortgage borrower there are, the more credit defaults there will be. From the perspective of loan-related characteristics, Lim and Yeok [3] singles out that the term of service, the existing relationship with banks, the available guarantors and the interest rate may be related to credit default. From the perspective of automobile sales price, Liu and Xu [4] states that different price ranges have a significant impact on loan default. From the borrower's point of view, Li and Ren [5] states that Logistic model has the highest accuracy in predicting default. Based on large sample data model, Shu and Yang [6] points out that logistic model has the highest accuracy in predicting default. Liu [7] has constructed a practical default discrimination model in accordance with the combination of random forest and logical regression. Walks [8] has built an ordinary least squares (OLS) model to study consumer credit and household income.
According to the data of commercial automobile credit in India, Agrawal et al. [9] has set up a binary logistic regression prediction model. It is found that domestic and foreign scholars have relatively little research on the prediction model of automobile credit default, and that the prediction model is single, mostly using logistic regression. However, there are many models for prediction: binary classifier based on machine and in-depth learning model [10], Bayesian classifier based on improving the accuracy of default probability estimation [11], default prediction model based on support vector machine (SVM) theory [12], topic prediction model based on random forest theory [13], and data block category prediction model based on k-nearest neighbor (KNN) theory [14]. Among many methods of building prediction models, SVM can fit the data well, and random forest can analyze feature importance. So in this paper, we use random forest to choose the features of model, and the data after dimension reduction is brought into the SVM prediction model for analysis. Furthermore, this paper compares the predictive results of SVM with those of random forest, KNN classification, logistic, decision tree, and artificial neural network (ANN).
This paper mainly studies the prediction model of automobile credit default. Firstly, data pretreatment is carried out to delete some features that are not related to label variables. Then, data are processed by feature engineering. The importance of features is analyzed by the method of random forest feature selection. Secondly, the data are grouped randomly by the same number, and SVM, random forest and KNN, logistic, decision tree, and ANN prediction models are used to predict the grouped data and obtain the mean value of performance indicators, respectively. Then the prediction accuracy of these six models are compared. Finally, the classifier model with the best prediction performance indicators is used to help auto finance companies decide whether to borrow or not in the light of customer defaults.
2. Basic Algorithms
2.1 Support Vector Machine
SVM is a supervised learning method based on statistical learning theory, which is proposed by Vapnik [15]. Unlike traditional logistic regression method, SVM can process high-latitude data and solve the problem of dimensional disaster. At the same time, the SVM can avoid falling into the local minimum. SVM has good robustness and has been successfully applied in pattern recognition, signal processing and image analysis. SVM classifies high-dimensional data by kernel method. The basic idea is to map the data samples from the primitive feature space to the higher dimensional feature space through the nonlinear function, and the original feature space is shown in Fig. 1.
Fig. 2 shows the distribution of features projected into a high-dimensional space. And then find the optimal classification hyperplane in the high dimensional feature space, so that the nonlinear separable problem can be transformed into a linear separable problem,
SVM can find an optimal classification hyperplane by mapping the original data samples into highdimensional space through the kernel function [TeX:] $$w^{T} \Phi(x)+b=0,$$ as shown in Fig. 3.
Data acquisition from IoTs to MEC server using a UAV.
The higher dimensional feature space H.
Classification hyperplane of support vector machine.
The upper boundary of the hyperplane is [TeX:] $$w^{T} \Phi(x)+b=1,$$ and the sample point is positive in the upper boundary region; the lower boundary is [TeX:] $$w^{T} \Phi(x)+b=-1,$$ and the sample point is negative in the lower boundary region. The maximum distance between the samples is [TeX:] $$\gamma=\frac{2}{\|w\|} ..$$ For the convenience of calculation, the classification problem of SVM can be summarized as the following formula:
Since the above formula involves the calculation of [TeX:] $$\Phi\left(x_{i}\right)^{T} \Phi\left(x_{j}\right),$$ which is the inner product of the mapping function after the original data is mapped to the high-dimensional feature space, and the feature dimension in the feature space is also very high, a kernel function is introduced to replace the inner product operation of the mapping function:
Then the above equation is introduced into formula and written as follows:
The optimal solution can be obtained by solving the above formula. In practice, the application of different kernels will lead to the SVM classifiers with different accuracy. We must build model many times to improve the performance of the model according to the actual situation. At present, the kernels commonly used in research are:
- Linear kernel:
- Polynomial kernel:
- RBF kernel:
- Sigmoid kernel:
When the above four kinds of kernels are applied in different fields, the prediction accuracy of the SVM classifier is different, which depends on the actual situation to choose which kind of kernels.
2.2 Random Forest
Random forest is a machine learning method that combines multiple decision trees for prediction. Its output category is determined by the mode of the output category of a single decision tree. The random forest algorithm procedure is shown as follows:
(1) In the original training samples, bootstrap method is used to randomly extract s samples with playback, forming a new training sample set and constructing s classification decision tree. The remaining samples that are not extracted are called out-of-pocket data for prediction and evaluation.
(2) Assuming that the total number of sample features is n, m features [TeX:] $$(m<n)$$ are randomly selected at the splitting nodes of each decision tree, and the information contained in each feature is calculated. Then the optimal feature is selected among the m features for splitting.
(3) Each decision tree is not pruned until it has grown.
(4) All decision trees are combined to form a random forest, and the predicted results are output by the random forest.
Random forests are widely used in pattern recognition [16], feature selection [17,18], and classification prediction [19]. They can process high-dimensional data, run faster, and do not need feature selection in the process of operation. At the end of operation, the importance of each feature will be given directly, and the importance of features can be improved. Random forest is a better feature selection method for row ranking.
2.3 K-Nearest Neighbor
KNN is a basic machine learning method in data mining. It does not need model training and parameter estimation. Compared with other machine learning methods, KNN classification algorithm is simple and easy to implement. It has three basic elements: K selection, distance calculation and classification rules. The core idea is that in the K most similar samples, if most of the samples belong to a certain category, the samples that need to be predicted also belong to that category. In short, “the minority is subordinate to the majority.” The steps of KNN classification algorithm are described as follows:
(1) Normalize the data and preprocess other data.
(2) Calculate the distance between training samples and test samples. The calculation methods include Manhattan distance or Euclidean distance. Euclidean distance is a common distance calculation method.
(3) Ranking according to distance, K class labels nearest to the test sample set were selected.
(4) Calculate the frequency of each category in K category labels, and take the category with the highest frequency as the category of test samples.
2.4 Logistic
Logistic model is evolved on the basis of linear regression. It can not only achieve classified prediction, but also carry out regression analysis. In most cases, logistic model is used in the scenario of binary prediction. Sigmoid function, loss function and random gradient algorithm are the core contents of logistic regression model. Sigmoid function can map the variable value between 0 and 1. Loss function is used to confirm the gap between the predicted value of samples and the real value. The smaller the gap, the better the model performance. The random gradient algorithm is used to optimize the loss function. Each of these three parts performs its own duties. When the logistic model deals with the binary classification problem, if the final value is close to 0, the sample prediction result may represent the first category. Otherwise, if the value is close to 1, the sample prediction result may represent the second category.
2.5 Decision Tree
Decision tree is a common machine learning classification method, and its generation algorithms include ID3, C4.5, and CART. Different algorithms follow different principles when choosing split nodes. Generating a complete decision tree includes the determination of root node, the selection of split node and the final leaf node. If there is over fitting, the decision tree needs to be pruned. In addition, the decision tree is the basis of the random forest classifier, which can not only predict the two classifications, but also carry out regression analysis.
2.6 Artificial Neural Network
ANN has always been a research hotspot in the field of artificial intelligence. It is to imitate people’s thinking ability to learn data so as to achieve the purpose of analysis. Artificial neural network includes three parts: input layer, middle layer and output layer. In the model, the data sample enters the input layer through the weighted combination of different weights, then goes through the middle layer processing, and finally gets the result from the output layer. With different weights and activation functions, the output of the model may be very different. In recent years, with the development of artificial neural network, it is widely used in different fields, such as image recognition, intelligent machine and risk prediction.
3. Example Application
3.1 Data Selection and Preprocessing
The experimental data are from the kaggle machine learning database on automobile credit default (https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-prediction#test.csv). The total number of samples was 225,490, of which 48,967 were positive (default) and 176,523 were negative (no default). Each sample, including label variables, are a total of 41 features. The feature Date.of.Birth is the date of birth of the customer. Since the date of birth of the customer is not clear, the month of birth of the customer is selected while recorded. The feature Employment.Type has many missing values, which are deleted when the results are little affected. In feature Employment.Type, “salaried” is coded as 0, while “self_employed” is coded as 1. The DisbursalDate is a feature of date form. In order to facilitate feature processing, we only record the year. The perform_cns.score.description is a multi-class feature, which has 20 values. Before coded, a (very low risk), b (very low risk), c (very low risk), d (very low risk), e (low risk), f (low risk), and g (low risk) are recorded as low risk; h (medium risk) and i (medium risk) are recorded as medium risk; j (high risk), k (high risk), l (very high risk), and m (very high risk) are recorded as high risk; no bureau history available, not scored (no activity seen on the customer (inactive)), not scored (not enough info available on the customer), not scored (only a guarantor), not scored (sufficient history not available), not scored (more than 50 active accounts found), and not scored (no updates available in last 36 months) are recorded as not scored. Then, these four categories are coded. The attribute values of feature average.acct.age and feature credit.history.length are the string forms of date. The number of years is extracted and converted into months when recording. And the year is calculated into 24 months. For example, “4yrs 8mon” is recorded as 54 months. The feature loan_default is the default result, whose value 0 represents no default and whose value 1 represents default. The remaining data are recorded normally. In this experiment, there are 7,661 missing values in the original data samples, which are deleted without much influence on the results.
3.2 Feature Selection
Among the 41 features (including label feature), the feature uniqueID is not an input variable because it obviously has nothing with automobile credit default. The attribute values of branch_id, supplier_id, manufacturer_id, Current_pincode_id, state_ID, and Employee_code_ID are more than ten. Because these six features have little influence on the results, they are also not used as input variables to avoid the problem of “dimension disaster.” In data preprocessing, the attribute value of the perform_cns.score.description is re-coded, and this feature with twenty attribute values is replaced by four encoded attribute values when imported into the predicting model. Simply speaking, a feature is processed into four features while the feature is input. So before further feature selecting, there are 36 features in the sample data (excluding label feature). Among many methods of feature selection, random forest can process high-dimensional data. It can not only calculate the importance of a single feature variable (excluding label variables), but also rank it according to its importance. So 36 feature variables are selected by random forest classifier. The results of feature selection for random forest are shown in Fig. 4.
Random forest feature importance ranking.
Fig. 4 shows that the degree of the most important feature is more than 0.16, and nearly half of features’ importance are less than 0.01. During the experiment, whether the feature importance is greater than 0.01 or not, it has little effect on the performance of the model. This means that those features with lower importance have little effect on the prediction performance of the model. Therefore, in the process of prediction, in order to improve the efficiency of model operation, after comprehensive consideration, 36 features are not chosen as input variables. The experiment chooses the features whose importance is greater than 0.01 as input variables. Finally, there are 17 features chosen as input variables and 1 feature chosen as output variable, whose definitions and importance are described as Table 1.
From the importance of features in Table 1, it can be seen that disbursed amount, asset cost and DisbursalDate have a big impact on loan_default, and that the importance of these three indicators has exceeded 0.01. This shows that paying attention to these three aspects of borrowers is conducive to reduce credit losses of auto financial institutions.
Definition of characteristic variables
3.3 Standardization and Data Partition
During the study, standardization is used to improve the efficiency and reduce losses of predicting model. Standardization is the conversion of all data to near zero mean with variance of 1, and the formula is:
In this experiment, the attribute values of pri.current.balance, pri.sanctioned.amount, pri.disbursed.amount, and primary.instal.amt are very different, as shown in Fig. 5.
As can be seen from Fig. 5, the range of the attribute values of pri.disbursed.amount is about 0 to 1 billion, and the data range of these four features is greatly reduced by standardizing their attribute values, as shown in Fig. 6.
From the comparison of the above two images, we can see that even though the range of attribute values of feature pri.disbursed.amount is the largest, the range of attribute values of feature has been greatly reduced by standardization. What’s more, normalization of data is also conducive to the construction of KNN, logistic, ANN, and SVM model. In this experiment, due to the great difference between the characteristic attribute values of sample data, if directly brought into the model for prediction, it will not only cause some errors, but also may affect the prediction accuracy of the model. Therefore, in order to reduce these effects, the sample data are normalized.
The number of sample data that are selected in this study is 225,490. In order to improve the accuracy of model, the sample data are disrupted line by line, and the disrupted sample data are randomly grouped by the same number. Then each group of data is divided into training set and test set. The training set accounts for 60% of the total sample, while the test set accounts for 40% of the total sample. Finally, the predicted results after grouping are recorded, and the average value of the data is calculated to evaluate the classification and prediction performance of each model.
Data before standardization.
Data after standardization.
3.4 Results and Analysis
We use PyCharm 2019.1 software to build SVM, random forest, KNN, logistic, decision tree, and ANN model. In order to verify whether these six classification algorithms can successfully predict automobile credit default, the training set and test set samples are imported into these six models to predict, and the corresponding performance indexes are calculated: accuracy (Acc), specificity (Spe), recall, f1_score, and AUC. And their calculation formula is described as below:
where,
TP: the number of samples that are actually positive examples and also predicted positive examples;
TN: the number of samples that are actually positive examples but predicted negative examples;
FP: the number of samples that are actually negative examples and also predicted negative examples;
FN: the number of samples that are actually negative examples but predicted positive examples;
p: the abbreviation of precision;
r: the abbreviation of recall.
The above indexes are used to evaluate the prediction performance of SVM with different kernel functions, random forest, KNN, decision tree, logistic, and ANN. The predicted results of three kernel function algorithms of SVM are shown in Fig. 7.
Prediction results of three kernel function.
As can be seen from Fig. 7, no matter which of the three kernel functions is used by SVM, their accuracy and specificity are almost the same, which shows that these three kernel functions are not important factors affecting the prediction performance of SVM. At the same time, we compare the overall performance of SVM with other prediction algorithms, as shown in Table 2.
The prediction results of these classification algorithms show that the accuracy of decision tree is the highest, reaching 0.79. The recall, f1_score and AUC of SVM, ANN and Logistic are the same. This means that the prediction performance of these three classification algorithms is almost the same under this application situation. However, after comprehensive comparison, it is found that the prediction performance of SVM is slightly better. Due to the problem of data imbalance in this paper, in order to further improve the performance of the model, the method of up sampling is used to balance the samples. The results are shown in Table 3.
Prediction results of six classification algorithms (before over sampling)
Prediction results of six classification algorithms (after over sampling)
Operational results of grouped and ungrouped samples for different models
It can be seen from Table 3 that the results of Acc, recall and f1_score of SVM are the best after over sampling, but AUC is the lowest of the six prediction algorithms, which indicates that the classification effect of SVM is not ideal. Compared with Tables 2 and 3, it can be found that the results of the data without over sampling are better, and that the performance of SVM is slightly better. In general, when we encounter data imbalance, we will use various methods to solve this problem, or integrate various algorithms to optimize this problem. The results of this paper show that to a certain extent, without data imbalance processing, we can directly predict through the algorithm, and the effect is better.
In the experiment, we find that grouping data can improve the efficiency of the model. As we all know, when dealing with a large number of data, the operational efficiency of SVM will become very low. And there are 233,154 samples in this experiment, so we randomly divide the samples into 10 groups, then the ten groups of data are brought into the six models. The results are shown in Table 4.
In Table 4, we can see that after grouping, except ANN, the operational efficiency of the other five prediction models has been improved. The most obvious improvement of the operational efficiency is SVM. When the samples are not grouped, the model operation time is 8 hours and 4 minutes and 54.9 seconds. After the samples are grouped, the model operation time is 11 minutes and 42.8 seconds. This shows that in dealing with large sample data, the method of random grouping can be used to improve the operation efficiency of the model.
Random grouping only can cut large samples into small samples, which improve the efficiency of the model, but cannot improve the performance of the model. We use Levene test to verify whether the variance of these 10 groups of data is different. The experimental results show that the statistics is 0.900225 and p-value is 0.523900. A p-value is greater than 0.05, which means that there is no difference in the variance of the ten groups of data. In order to be able to verify that there is no difference in the mean value of the data after grouping, we make t-test on these ten groups of data by pairwise comparison analysis. The details are shown in Table 5.
From Table 5, it can be seen that the p-value of every two groups of data is greater than 0.05, which shows that there is no difference in the mean value of these ten groups of data after random grouping. It also indirectly proves that random grouping only improves the operational efficiency but has little improvement in other aspects.
t-test results of every two groups of data
4. Conclusion
Based on SVM theory, this paper constructs three kinds of kernel function prediction models in order to explore which kind of kernel function of SVM makes better prediction results by using the automobile credit data. The results show that when the kernel function is linear or RBF (radial basis function), the prediction results are the same. Then the average value of the predicted results of three kinds of kernels is calculated, and we compare this result with the prediction results of random forest, KNN, logistic, decision tree, and ANN. The test results preliminarily verify that these six algorithms can be applied to predict the default of automobile credit, which is helpful for the automobile financial institutions to evaluate the default risk of loans. However, during the experimental process, we use the over sampling method to solve the problem of data imbalance, and the results show that the performance of the model has not improved, which shows that to some extent, we do not need to focus on solving the problem of data imbalance. At the same time, we find that random grouping can shorten the running time of SVM, logistic, decision tree, random forest, and KNN these five models. Among them, the most obvious improvement of the performing efficiency is SVM, which shows that in the future, if we use SVM to process large sample data, we can use random grouping method. What’s more, the research finds that the model of automobile credit default prediction is relatively simple. We need to explore whether other better prediction models can be constructed in this application scenario. At the same time, in the actual situation, the influencing factors of consumer credit default are more complex. Different automobile financial institutions establish different credit mechanisms. We need to use the most representative features to build targeted predicting models in the different scenarios.
Acknowledgement
This study is funded by the Sichuan Provincial Education Department project (No. 2018ZB0460).