## Ying Chen* and Ruirui Zhang*## |

No | Feature | Definition | Importance |
---|---|---|---|

1 | date.of.birth | date of birth of the customer | 0.062916 |

2 | Employment.Type | employment type of the customer(salaried/self_employed) | 0.014981 |

3 | disbursed_amount | amount of loan disbursed | 0.169733 |

4 | asset_cost | cost of the asset | 0.175458 |

5 | ltv | loan to value of the asset | 0.091125 |

6 | DisbursalDate | date of disbursement | 0.102625 |

7 | perform_cns.score | bureau score | 0.047888 |

8 | pri.active.accts | count of active loans taken by the customer at the time of disbursement | 0.013023 |

9 | pri.no.of.accts | count of total loans taken by the customer at the time of disbursement | 0.024398 |

10 | pri.current.balance | total principal outstanding amount of the active loans at the time of disbursement | 0.038587 |

11 | pri.sanctioned.amount | total amount that was sanctioned for all the loans at the time of disbursement | 0.036265 |

12 | pri.disbursed.amount | total amount that was disbursed for all the loans at the time of disbursement | 0.035350 |

13 | primary.instal.amt | EMI amount of the primary loan | 0.036934 |

14 | new.accts.in.last.six.months | new loans taken by the customer in last 6 months before the disbursement | 0.010882 |

15 | average.acct.age | average loan tenure | 0.060287 |

16 | credit.history.length | time since first loan | 0.063644 |

17 | no.of_inquiries | time since first loan | 0.015904 |

18 | loan_default | - | - |

During the study, standardization is used to improve the efficiency and reduce losses of predicting model. Standardization is the conversion of all data to near zero mean with variance of 1, and the formula is:

In this experiment, the attribute values of pri.current.balance, pri.sanctioned.amount, pri.disbursed.amount, and primary.instal.amt are very different, as shown in Fig. 5.

As can be seen from Fig. 5, the range of the attribute values of pri.disbursed.amount is about 0 to 1 billion, and the data range of these four features is greatly reduced by standardizing their attribute values, as shown in Fig. 6.

From the comparison of the above two images, we can see that even though the range of attribute values of feature pri.disbursed.amount is the largest, the range of attribute values of feature has been greatly reduced by standardization. What’s more, normalization of data is also conducive to the construction of KNN, logistic, ANN, and SVM model. In this experiment, due to the great difference between the characteristic attribute values of sample data, if directly brought into the model for prediction, it will not only cause some errors, but also may affect the prediction accuracy of the model. Therefore, in order to reduce these effects, the sample data are normalized.

The number of sample data that are selected in this study is 225,490. In order to improve the accuracy of model, the sample data are disrupted line by line, and the disrupted sample data are randomly grouped by the same number. Then each group of data is divided into training set and test set. The training set accounts for 60% of the total sample, while the test set accounts for 40% of the total sample. Finally, the predicted results after grouping are recorded, and the average value of the data is calculated to evaluate the classification and prediction performance of each model.

We use PyCharm 2019.1 software to build SVM, random forest, KNN, logistic, decision tree, and ANN model. In order to verify whether these six classification algorithms can successfully predict automobile credit default, the training set and test set samples are imported into these six models to predict, and the corresponding performance indexes are calculated: accuracy (Acc), specificity (Spe), recall, f1_score, and AUC. And their calculation formula is described as below:

where,

TP: the number of samples that are actually positive examples and also predicted positive examples;

TN: the number of samples that are actually positive examples but predicted negative examples;

FP: the number of samples that are actually negative examples and also predicted negative examples;

FN: the number of samples that are actually negative examples but predicted positive examples;

p: the abbreviation of precision;

r: the abbreviation of recall.

The above indexes are used to evaluate the prediction performance of SVM with different kernel functions, random forest, KNN, decision tree, logistic, and ANN. The predicted results of three kernel function algorithms of SVM are shown in Fig. 7.

As can be seen from Fig. 7, no matter which of the three kernel functions is used by SVM, their accuracy and specificity are almost the same, which shows that these three kernel functions are not important factors affecting the prediction performance of SVM. At the same time, we compare the overall performance of SVM with other prediction algorithms, as shown in Table 2.

The prediction results of these classification algorithms show that the accuracy of decision tree is the highest, reaching 0.79. The recall, f1_score and AUC of SVM, ANN and Logistic are the same. This means that the prediction performance of these three classification algorithms is almost the same under this application situation. However, after comprehensive comparison, it is found that the prediction performance of SVM is slightly better. Due to the problem of data imbalance in this paper, in order to further improve the performance of the model, the method of up sampling is used to balance the samples. The results are shown in Table 3.

Table 2.

Classification method | Acc | recall | f1_score | AUC |
---|---|---|---|---|

Overall performance of SVM | 0.78 | 1 | 0.88 | 0.78 |

Random Forest | 0.78 | 0.96 | 0.87 | 0.77 |

KNN | 0.74 | 0.92 | 0.85 | 0.74 |

Decision Tree | 0.79 | 0.77 | 0.78 | 0.66 |

Logistic | 0.69 | 1 | 0.88 | 0.78 |

ANN | 0.69 | 1 | 0.88 | 0.78 |

Table 3.

sClassification method | Acc | recall | f1_score | AUC |
---|---|---|---|---|

Overall performance of SVM | 0.78 | 0.78 | 0.69 | 0.50 |

Random Forest | 0.66 | 0.66 | 0.65 | 0.72 |

KNN | 0.64 | 0.64 | 0.63 | 0.69 |

Decision Tree | 0.65 | 0.64 | 0.65 | 0.71 |

Logistic | 0.57 | 0.57 | 0.57 | 0.61 |

ANN | 0.59 | 0.59 | 0.59 | 0.63 |

Table 4.

Operation time of the model | ||
---|---|---|

Ungrouped samples | Grouped samples | |

SVM | 8h4m54.90s | 7m43.91 |

Random Forest | 14.8s | 10.3s |

Decision Tree | 9.6s | 3.5s |

KNN | 4m11.8s | 1m26.0s |

ANN | 1m58.2s | 2m59.1s |

Logistic | 7.98s | 3.00s |

It can be seen from Table 3 that the results of Acc, recall and f1_score of SVM are the best after over sampling, but AUC is the lowest of the six prediction algorithms, which indicates that the classification effect of SVM is not ideal. Compared with Tables 2 and 3, it can be found that the results of the data without over sampling are better, and that the performance of SVM is slightly better. In general, when we encounter data imbalance, we will use various methods to solve this problem, or integrate various algorithms to optimize this problem. The results of this paper show that to a certain extent, without data imbalance processing, we can directly predict through the algorithm, and the effect is better.

In the experiment, we find that grouping data can improve the efficiency of the model. As we all know, when dealing with a large number of data, the operational efficiency of SVM will become very low. And there are 233,154 samples in this experiment, so we randomly divide the samples into 10 groups, then the ten groups of data are brought into the six models. The results are shown in Table 4.

In Table 4, we can see that after grouping, except ANN, the operational efficiency of the other five prediction models has been improved. The most obvious improvement of the operational efficiency is SVM. When the samples are not grouped, the model operation time is 8 hours and 4 minutes and 54.9 seconds. After the samples are grouped, the model operation time is 11 minutes and 42.8 seconds. This shows that in dealing with large sample data, the method of random grouping can be used to improve the operation efficiency of the model.

Random grouping only can cut large samples into small samples, which improve the efficiency of the model, but cannot improve the performance of the model. We use Levene test to verify whether the variance of these 10 groups of data is different. The experimental results show that the statistics is 0.900225 and p-value is 0.523900. A p-value is greater than 0.05, which means that there is no difference in the variance of the ten groups of data. In order to be able to verify that there is no difference in the mean value of the data after grouping, we make t-test on these ten groups of data by pairwise comparison analysis. The details are shown in Table 5.

From Table 5, it can be seen that the p-value of every two groups of data is greater than 0.05, which shows that there is no difference in the mean value of these ten groups of data after random grouping. It also indirectly proves that random grouping only improves the operational efficiency but has little improvement in other aspects.

Table 5.

xStat | p-value | Stat | p-value | ||
---|---|---|---|---|---|

group1 & group2 | -1.131921 | 0.257674 | group3 & group10 | 0.423476 | 0.671950 |

group1 & group3 | -0.618440 | 0.536289 | group4 & group5 | -0.593595 | 0.552786 |

group1 & group4 | -0.515539 | 0.606179 | group4 & group6 | -1.298408 | 0.194154 |

group1 & group5 | -1.109136 | 0.267378 | group4 & group7 | 0.068639 | 0.945278 |

group1 & group6 | -1.813955 | 0.069691 | group4 & group8 | -0.991861 | 0.321271 |

group1 & group7 | -0.446900 | 0.654949 | group4 & group9 | 0.458171 | 0.646832 |

group1 & group8 | -1.507405 | 0.131714 | group4 & group10 | 0.320576 | 0.748533 |

group1 & group9 | -0.057368 | 0.954252 | group5 & group6 | -0.704809 | 0.480933 |

group1 & group10 | -0.137595 | 0.845423 | group5 & group7 | 0.662233 | 0.507825 |

group2 & group3 | 0.513479 | 0.607619 | group5 & group8 | -0.398265 | 0.690437 |

group2 & group4 | 0.616380 | 0.537647 | group5 & group9 | 1.051767 | 0.292912 |

group2 & group5 | 0.022785 | 0.981822 | group5 & group10 | 0.914171 | 0.360632 |

group2 & group6 | -0.682024 | 0.495227 | group6 & group7 | 1.367047 | 0.171617 |

group2 & group7 | 0.685019 | 0.493336 | group6 & group8 | 0.306544 | 0.759192 |

group2 & group8 | -0.375479 | 0.707306 | group6 & group9 | 1.756585 | 0.078995 |

group2 & group9 | 1.074553 | 0.282581 | group6 & group10 | 1.618988 | 0.105457 |

group2 & group10 | 0.936957 | 0.348786 | group7 & group8 | -1.060500 | 0.288923 |

group3 & group4 | 0.102900 | 0.918042 | group7 & group9 | 0.389532 | 0.696884 |

group3 & group5 | -0.490694 | 0.623645 | group7 & group10 | 0.251937 | 0.801091 |

group3 & group6 | -1.195506 | 0.231896 | group8 & group9 | 1.450036 | 0.147055 |

group3 & group7 | 0.171539 | 0.863801 | group8 & group10 | 1.312439 | 0.189379 |

group3 & group8 | -0.888960 | 0.374029 | group9 & group10 | -0.137595 | 0.890561 |

group3 & group9 | 0.561072 | 0.574752 | - | - | - |

**Significant difference at 0.05 level.

Based on SVM theory, this paper constructs three kinds of kernel function prediction models in order to explore which kind of kernel function of SVM makes better prediction results by using the automobile credit data. The results show that when the kernel function is linear or RBF (radial basis function), the prediction results are the same. Then the average value of the predicted results of three kinds of kernels is calculated, and we compare this result with the prediction results of random forest, KNN, logistic, decision tree, and ANN. The test results preliminarily verify that these six algorithms can be applied to predict the default of automobile credit, which is helpful for the automobile financial institutions to evaluate the default risk of loans. However, during the experimental process, we use the over sampling method to solve the problem of data imbalance, and the results show that the performance of the model has not improved, which shows that to some extent, we do not need to focus on solving the problem of data imbalance. At the same time, we find that random grouping can shorten the running time of SVM, logistic, decision tree, random forest, and KNN these five models. Among them, the most obvious improvement of the performing efficiency is SVM, which shows that in the future, if we use SVM to process large sample data, we can use random grouping method. What’s more, the research finds that the model of automobile credit default prediction is relatively simple. We need to explore whether other better prediction models can be constructed in this application scenario. At the same time, in the actual situation, the influencing factors of consumer credit default are more complex. Different automobile financial institutions establish different credit mechanisms. We need to use the most representative features to build targeted predicting models in the different scenarios.

She received B.S., M.S., and Ph.D. degrees in School of Computer Science from Sichuan University in 2004, 2007, and 2012, respectively. She is a lecturer at the School of Business, Sichuan Agricultural University, China. Her current research interests include network security, wireless sensor networks, intrusion detection and artificial immune systems.

- 1 Y. Li, "2018: What happened to the Chinese car industry?,"
*Shanghai Enterprise*, vol. 2019, no. 1, pp. 50-52, 2019.custom:[[[-]]] - 2
*K. Wack, 2019 (Online). Available:*, https://www.americanbanker.com/news/large-banks-assuming-more-risk-in-auto-lending - 3 H. E. Lim, S. G. Yeok, "Estimating the determinants of vehicle loan default in Malaysia: an exploratory study,"
*International Journal of Management Studies*, vol. 24, no. 1, pp. 73-90, 2017.custom:[[[-]]] - 4 H. Liu, J. Xu, "An empirical analysis of my country’s auto consumer loan default: micro-evidence from commercial banks,"
*South China Finance*, vol. 1, no. 5, pp. 84-91, 2015.custom:[[[-]]] - 5 Y. Li, J. Ren, "Research on the causes of risk of default of personal vehicle loan in automobile finance company,"
*China Market*, vol. 2011, no. 2, pp. 138-140, 2011.custom:[[[-]]] - 6 Y. Shu, Q. Yang, "Research on auto loan default prediction based on large sample data model,"
*Management Review*, vol. 29, no. 9, pp. 59-71, 2017.custom:[[[-]]] - 7 K. Liu, "Application of random forest and logical regression model in default prediction,"
*China Computer & Communication*, vol. 2016, no. 21, pp. 111-112, 2016.custom:[[[-]]] - 8 A. Walks, "Driving the poor into debt? Automobile loans, transport disadvantage, and automobile dependence,"
*Transport Policy*, vol. 65, pp. 137-149, 2018.custom:[[[-]]] - 9 M. Agrawal, A. Agrawal, A. Raizada, "Predicting defaults in commercial vehicle loans using logistic regression: case of an Indian NBFC,"
*CLEAR International Journal of Research in Commerce & Management*, vol. 5, no. 5, pp. 22-28, 2014.custom:[[[-]]] - 10 P. M. Addo, D. Guegan, B. Hassani, "Credit risk analysis using machine and deep learning models,"
*Risks2018*, vol. 6, no. 2, 2003.doi:[[[10.3390/risks608]]] - 11 R. Kazemi, A. Mosleh, "Improving default risk prediction using Bayesian model uncertainty techniques,"
*Risk Analysis: An International Journal*, vol. 32, no. 11, pp. 1888-1900, 2012.doi:[[[10.1111/j.1539-6924.2012.01915.x]]] - 12 X. Guo, Y Wu, "Analysis of investment decision in P2P lending based on support vector machine,"
*China Sciencepaper Online*, vol. 2017, no. 5, pp. 542-547, 2017.custom:[[[-]]] - 13 C. Jiang, Z. Wang, R. Wang, Y. Ding, "Loan default prediction by combining soft information extracted from descriptive text in online peer-to-peer lending,"
*Annals of Operations Research*, vol. 266, no. 1, pp. 511-529, 2018.doi:[[[10.1007/s10479-017-2668-z]]] - 14 X. Deng, L. Zhao, "Speeding K-NN classification method based on data block mixed measurement,"
*Computer and Modernization*, vol. 2012, no. 12, pp. 47-50, 2016.custom:[[[-]]] - 15
*V*, The Nature of Statistical Learning Theory. New YorkNY: Springer, V apnik, 2010.custom:[[[-]]] - 16 N. Radhika, S. B. Senapathi, R. Subramaniam, R. Subramany, K. N. Vishnu, "Pattern recognition based surface roughness prediction in turning hybrid metal matrix composite using random forest algorithm,"
*Industrial Lubrication and Tribology*, vol. 65, no. 5, pp. 311-319, 2013.custom:[[[-]]] - 17 D. Yao, J. Yang, X. Zhan, "Feature selection algorithm based on random forest,"
*Journal of Jilin University (Engineering and Technology Edition)*, vol. 44, no. 1, pp. 137-141, 2014.custom:[[[-]]] - 18 B. Gregorutti, B. Michel, P. Saint-Pierre, "Correlation and variable importance in random forests,"
*Statistics and Computing*, vol. 27, no. 3, pp. 659-678, 2017.doi:[[[10.1007/s11222-016-9646-1]]] - 19 D. Denisko, M. M. Hoffman, "Classification and interaction in random forests," in
*Proceedings of the National Academy of Sciences*, 2018;vol. 115, no. 8, pp. 1690-1692. custom:[[[-]]]