## Girija Attigeri* , Manohara Pai M. M.** and Radhika M. Pai***## |

Sl. No. | Attribute No. | IG |
---|---|---|

1 | 632 | 0.003358 |

2 | 532 | 0.002738 |

3 | 14 | 0.002684 |

4 | 758 | 0.002513 |

5 | 7 | 0.002493 |

6 | 143 | 0.002458 |

7 | 593 | 0.002402 |

8 | 278 | 0.002398 |

9 | 69 | 0.002395 |

10 | 557 | 0.002383 |

Fig. 4 shows that LR consistent performance compared with the SVM. However, the SVM trend line shows declining performance as the number of attributes increase. The SVM kernel function transforms the data into higher dimensional space so that the data becomes linearly separable for the classification task. When the data is high dimensional with many features, it is difficult to clearly define linear separability and hence, the performance of the SVM declines.

The ROC graphs obtained for the SVM are depicted in Fig. 5. It shows that for a subset with 60 features, the performance of the SVM is good, but for others it does not exhibit good performance. Hence, for the SVM, the subset with sixty features is an optimal subset. The ROC graphs obtained for LR are shown in Fig. 6. It shows consistent performance for all the subsets, except for the subset with features, which indicate variation in specificity and sensitivity. Hence, the SVM shows good performance when 60 top features are considered. The LR presents stable performance for all the subsets.

The time complexity of the feature selection algorithm is [TeX:] $$\theta\left(n^{2}\right)$$. The complexity of building LR model is Ө(mnp), where, m is the number of rows, n is the number of features, and p is the number of iterations until convergence. The time complexity for model building for SVM [32] is [TeX:] $$\mathrm{O}\left((\mathrm{mn})^{3}\right)$$ . Hence, as the features decrease, the time also decreases, at least linearly. It is also observed empirically that when all the features are considered, the time taken is very high, and decreases significantly when the features are reduced. With the experiments conducted, it can be inferred that reducing the dimension does not deter the performance of the classification algorithms when an optimal subset of features is considered. It also shows exponential improvement in computational time as shown in Fig. 7.

The experiment is repeated to identify the optimal subset leading to significant improvement in the accuracy of the model. The results in Table 2 show that the top 60 best optimal features provide 91% accuracy in both the LR and SVM algorithms compared considering all the features. In order to carry out comparative analysis with the existing standard methods of feature selection, least absolute shrinkage and selection operator (LASSO) [33, 34] and feature extraction algorithm Principal Component Analysis (PCA) were implemented on the loan data set. The LASSO works on the principle of fitting a linear equation which best suits for computing the class variable. In the process, the features which get higher coefficients can be considered as features of high importance. In order to reduce the dimension of the feature set, the PCA transforms the given feature set on to another feature space using the eigen-values and the eigen-vectors of the mean adjusted input feature set. Hence, after transformation, it is difficult to know which component maps to which feature. Thus, the importance of the variables or features in the original data set cannot be analyzed. The CFS-SO allows the analysis of the importance of individual features with respect to the predictor variable effectively.

Table 2.

Feature selection algorithm | Classification algorithm | Number of features/Principal components | ||||||
---|---|---|---|---|---|---|---|---|

10 | 20 | 30 | 40 | 50 | 60 | All | ||

PCA | SVM | 0.40 | 0.40 | 0.56 | 0.67 | 0.912 | 0.921 | 0.908 |

LR | 0.068 | 0.08 | 0.87 | 0.82 | 0.913 | 0.919 | 0.808 | |

LASSO | SVM | 0.50 | 0.59 | 0.59 | 0.63 | 0.82 | 0.86 | 0.04 |

LR | 0.64 | 0.634 | 0.64 | 0.64 | 0.84 | 0.81 | 0.90 | |

CFS-SO | SVM | 0.25 | 0.0028 | 0.27 | 0.411 | 0.41 | 0.91 | 0.04 |

LR | 0.911 | 0.91 | 0.91 | 0.91 | 0.909 | 0.90 | 0.90 |

With the PCA, both the SVM and the LR work fine for principal components numbering more than 40. With CFS-SO, the LR shows better performance, whereas the SVM performs better for the top 60 features. With the LASSO, the LR outperforms the SVM. It can be inferred from both the CFS-SO and the LASSO that for the considered dataset, linear correlation with the predictor variable suits well and hence, the LR shows better performance. Considering the CFS-SO and LASSO, it can be observed that CFS-SO performs better.

The best accuracies obtained for LR and SVM after applying PCA, LASSO, and CFS-SO are shown in Fig. 8. The accuracies obtained without applying the feature selection considering all the features is shown in Fig. 9. These figures indicate the importance of the CFS-SO feature selection for improvising the accuracy of the prediction models.

The prediction models for financial health play a key role in controlling financial irregularities in a country’s economy. It is significant to employ such prediction models leveraging all the data available in the financial domain. Important predictions would be to know financial failure condition, fraudulent activities, bankruptcy and NPA, etc. The data for such models is not readily available as it contains noisy and redundant features as well. Hence, the data needs to be pre-processed and a right representative set of the data should be prepared. The present work focuses on understanding the suitability of the correlation based method using submodular optimization for the selection of features on voluminous data. First, the data is preprocessed by handling null values and converting the categorical data to numerical data. Then, the right subset of features is identified, which aids in predictive analysis of bad loans. The performance of the prediction algorithm is used as the evaluation metric for choosing the right subset. The experimental results show that subsets with optimal number of features do not deter the performance of the classification models. Such feature sets reduce the computational time exponentially. The performance comparison of the classification models with CFS-SO, LASSO, and PCA algorithms indicate that models with CFS-SO perform better than with LASSO. The CFS-SO can be chosen over PCA to retain the original input features.

The big data technology is used to exercise the relevance of the approach for the problem addressed. Big data preprocessing for improvement of predictive modelling is an essential step towards prediction of financial data analytics for NPA and fraud. Any future work can comprise of building of suitable predictive models for fraud detection utilizing pre-processed data using the above proposed approach.

She is currently Assistant Professor (Selection Grade) in the Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India. She has received B.E. and M.Tech. degrees from the Visvesvaraya Technological University, Karnataka, India. She has 12 years of experience in teaching and research.

He is a Professor and Associate Director of Research and Consultancy at the Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal. He has experience of 26 years in research, academics, and industry. He has received his Ph.D. from the University of Mysore at Karnataka, India. His research interests span big data analytics, wireless sensor networks, internet of things, cloud computing, and intelligent transportation system. He has publications in reputed international conferences and journals. He has six patents to his name and has authored two books. He has supervised four PhD and 80 plus post-graduate students. He was visiting professor of ESIGELEC-IRSEEM at the University of Rouen, France. He is the investigator for several projects funded by the Government of India and by various industries. He is IEEE Senior member and Chair of IEEE, Mangalore Subsection. She is a Professor at the Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, in the Department of Information and Communication Technology. She has experience in research, academics, and industry of about 25 years. She has received her Ph.D. from the National Institute of Technology, Karnataka, India. Big data analytics, database systems, data mining and warehousing and operating systems are her major research interests. She has publications in reputed international conferences and journals. She has received grants from the Government of India.

- 1 T. Seth, V. Chaudhary, "Big data in finance,"
*in Big Data: AlgorithmsAnalytics, and Applications. Boca Raton, FL: CRC Press*, pp. 329-356, 2015.doi:[[[10.33516/maj.v54i5.14-19p]]] - 2 I. Taleb, R. Dssouli, M. A. Serhani, "Big data pre-processing: a quality framework," in
*Proceedings of 2015 IEEE International Congress on Big Data*, New York, NY, 2015;pp. 191-198. custom:[[[-]]] - 3 J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, H. Liu, "Feature selection: a data perspective,"
*ACM Computing Surveys*, vol. 50, no. 6, 2018.doi:[[[10.1145/3136625]]] - 4 B. Arguello, "A survey of feature selection methods: algorithms and software,"
*PhD dissertationUniversity of Texas at Austin, TX*, 2015.custom:[[[-]]] - 5 A. Krause, "SFO: a toolbox for submodular function optimization,"
*Journal of Machine Learning Research*, vol. 11, pp. 1141-1144, 2010.doi:[[[10.1145/1756006.1756044]]] - 6 M. A. Fattah, "A novel statistical feature selection approach for text categorization,"
*Journal of Information Processing Systems*, vol. 13, no. 5, pp. 1397-1409, 2017.doi:[[[10.3745/JIPS.02.0076]]] - 7 K. Kira, L. A. Rendell, "A practical approach to feature selection," in
*Machine Learning Proceedings 1992. St. Louis*, MO: Elsevier, 1992;pp. 249-256. custom:[[[-]]] - 8 S. Fallahpour, E. N. Lakvan, M. H. Zadeh, "Using an ensemble classifier based on sequential floating forward selection for financial distress prediction problem,"
*Journal of Retailing and Consumer Services*, vol. 34, pp. 159-167, 2017.doi:[[[10.1016/j.jretconser.2016.10.002]]] - 9
*E. Wright, Q. Hao, K. Rasheed, and Y . Liu, 2018;*, https://arxiv.org/abs/1803.06615 - 10 S. D. Kim, "A feature selection technique based on distributional differences,"
*Journal of Informaion Processing System*, vol. 2, no. 1, pp. 23-27, 2006.custom:[[[-]]] - 11 S. Maldonado, J. Perez, C. Bravo, "Cost-based feature selection for support vector machines: an application in credit scoring,"
*European Journal of Operational Research*, vol. 261, no. 2, pp. 656-665, 2017.doi:[[[10.1016/j.ejor.2017.02.037]]] - 12 A. Krause, V. Cevher, "Submodular dictionary selection for sparse representation," in
*Proceedings of the 27th International Conference on Machine Learning (ICML)*, Haifa, Israel, 2010;pp. 567-574. custom:[[[-]]] - 13 Y. Bar, I. Diamant, L. Wolf, S. Lieberman, E. Konen, H. Greenspan, "Chest pathology identification using deep feature selection with non-medical training,"
*Computer Methods in Biomechanics and Biomedical Engineering: Imaging Visualization*, vol. 6, no. 3, pp. 259-263, 2018.doi:[[[10.1080/21681163.2016.1138324]]] - 14 R. Iyer, S. Jegelka, J. Bilmes, "Fast semidifferential-based submodular function optimization," in
*Proceedings of the 30th International Conference on Machine Learning (ICML)*, Atlanta, GA, 2013;pp. 855-863. custom:[[[-]]] - 15 K. Wei, Y. Liu, K. Kirchhoff, J. Bilmes, "Using document summarization techniques for speech data subset selection," in
*Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, Atlanta, GA, 2013;pp. 721-726. custom:[[[-]]] - 16 A. Krause, C. Guestrin, "A note on the budgeted maximization of submodular functions,"
*Carnegie Mellon UniversityTechnical Report No. CMU-CALD-05-103*, 2005.custom:[[[-]]] - 17 D. Kempe, J. Kleinberg, E. Tardos, "Maximizing the spread of influence through a social network," in
*Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, Washington, DC, 2003;pp. 137-146. custom:[[[-]]] - 18 G. L. Nemhauser, L. A. Wolsey, M. L. Fisher, "An analysis of approximations for maximizing submodular set functions - I,"
*Mathematical Programming*, vol. 14, no. 1, pp. 265-294, 1978.doi:[[[10.1007/BF01588971]]] - 19 M. A. Hall, "Correlation-based feature selection for machine learning,"
*PhD dissertationThe University of Waikato, Hamilton, New Zealand*, 1999.custom:[[[-]]] - 20 A. Pouramirarsalani, M. Khalilian, A. Nikravanshalmani, "Fraud detection in E-banking by using the hybrid feature selection and evolutionary algorithms,"
*International Journal of Computer Science and Network Security*, vol. 17, no. 8, pp. 271-279, 2017.custom:[[[-]]] - 21 Y. Wang, W. Ke, X. Tao, "A feature selection method for large-scale network traffic classification based on spark,"
*Information*, vol. 7, no. 6, 2016.doi:[[[10.3390/info7010006]]] - 22 H. D. Gangurde, "Feature selection using clustering approach for big data,"
*International Journal of Computer Applications*, vol. 2014, no. 4, pp. 1-3, 2014.custom:[[[-]]] - 23 P. Sarlin, "Data and dimension reduction for visual financial performance analysis,"
*Information Visualization*, vol. 14, no. 2, pp. 148-167, 2015.doi:[[[10.1177/1473871613504102]]] - 24 H. S. Bhat, D. Zaelit, "Forecasting retained earnings of privately held companies with PCA and L1 regression,"
*Applied Stochastic Models in Business and Industry*, vol. 30, no. 3, pp. 271-293, 2014.doi:[[[10.1002/asmb.1972]]] - 25 I. Pisica, G. Taylor, L. Lipan, "Feature selection filter for classification of power system operating states,"
*Computers Mathematics with Applications*, vol. 66, no. 10, pp. 1795-1807, 2013.doi:[[[10.1016/j.camwa.2013.05.033]]] - 26 H. Liu, H. Motoda,
*Feature Selection for Knowledge Discovery and Data Mining*, NY: Springer Science Business Media, New York, 2012.custom:[[[-]]] - 27 M. Dash, "Feature selection via set cover," in
*Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop*, Newport Beach, CA, 1997;pp. 165-171. custom:[[[-]]] - 28 A. Arauzo-Azofra, J. M. Benitez, J. L. Castro, "A feature set measure based on relief," in
*Proceedings of the 5th International Conference on Recent Advances Soft Computing*, Nottingham, UK, 2004;pp. 104-109. custom:[[[-]]] - 29 X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, et al., "MLlib: machine learning in Apache Spark,"
*The Journal of Machine Learning Research*, vol. 17, pp. 1-7, 2016.custom:[[[-]]] - 30
*K. Noyes, 2015;*, https://www.infoworld.com/article/3014440/five-things-you-need-to-know-about-hadoop-v-apache-spark.html - 31 P. Paakkonen, D. Pakkala, "Reference architecture and classification of technologies, products and services for big data systems,"
*Big Data Research*, vol. 2, no. 4, pp. 166-186, 2015.doi:[[[10.1016/j.bdr.2015.01.001]]] - 32 A. Abdiansah, R. Wardoyo, "Time complexity analysis of support vector machines (SVM) in LibSVM,"
*International Journal Computer and Application*, vol. 128, no. 3, pp. 28-34, 2015.doi:[[[10.5120/ijca2015906480]]] - 33
*J. Giersdorf and M. Conzelmann, 2017;*, https://www.ni.tu-berlin.de/fileadmin/fg215/teaching/nnproject/Lasso_Project.pdf - 34
*V . Fonti and E. Belitser, VU Amsterdam Research Paper in Business Analytics, 2017;*, https://beta.vu.nl/nl/Images/werkstuk-fonti_tcm235-836234.pdf