## Hyun-il Lim## |

CPU | RAM | Operating system | Language | Benchmark software set |
---|---|---|---|---|

Core i7-4790 | 16 GB | MS Windows 7 | Python, Scikit-learn | Java software, Jakarta ORO, ANTLR |

Table 2 shows the specification of the benchmark software for this experiment. We have independently used benchmark software sets for training and evaluation. The Java application, ANTLR [27], has been used as a training dataset for building linear regression models for interlaced features from the original training data. We generated the test dataset to evaluate the proposed approach from Jakarta ORO [28]. To reflect similar versions of software for classification, we have used the Smokescreen obfuscator to generate similar but modified versions of original benchmark software. Smokescreen changes the name used in software and modifies the control flow structures and instruction patterns of the software to modify the original program into a similar but obfuscated version that is difficult to understand. In other words, the Smokescreen obfuscator can transform an original program into similar versions with modified structures. The numbers of datasets we used for training and testing are 13,689 and 2,500, respectively. From the benchmark software, the code vectors were generated and compared with each other to generate the input data of the proposed multiple linear regression model.

Table 2.

Training data | Test data | |
---|---|---|

Software data | ANTLR 3.5.2 | Jakarta ORO 2.0.8 |

Number of class files | 117 | 50 |

Maximum number of bytecodes | 1,646 | 923 |

Average number of bytecodes | 172 | 144 |

Number of datasets | 13,689 | 2,500 |

Table 3 shows experimental results for the combined interlaced data model as compared to a conventional linear regression model that is presented in [14]. Three strands of interlaced data were used, and the total number of test datasets was 2,500. The proposed model was a combination of three models trained with the interlaced feature data. The results of software similarity classification were evaluated as correct if the results achieved with a model could classify similar and dissimilar software correctly. In the experimental results for the conventional linear regression model, 2,252 classifications were correct out of 2,500 test data items, for overall classification accuracy of 90.08%. For the proposed model, 2,356 classifications were correct out of 2,500, so the classification accuracy was 94.24%, which was significantly more accurate than the conventional linear regression model. Based on these experimental results, we conclude that the proposed approach is more effective than conventional linear regression for classifying similar software.

Table 3.

Conventional linear regression model [14] | Proposed model using interlaced data | |
---|---|---|

Number of interlaced datasets | 1 | 3 |

Total number of classifications | 2,500 | 2,500 |

Number of correct classifications | 2,252 | 2,356 |

Overall accuracy (%) | 90.08 | 94.24 |

From the experimental results, the classification accuracy of the proposed approach was about four percentage points higher than that of the conventional method. Moreover, the numbers of false positives and false negatives decreased in the proposed approach as well. This enhanced performance was to be expected: the conventional method produces only one classification model that is then trained with the original training data, so that noise values within the training data may lead the model to make incorrect classifications for similar software. The advantage of the method proposed in this paper is that it distributes the effects on classification results of feature values, including noise values, by separating the original data into several interlaced datasets. Since the interlaced datasets are generated by alternately extracting features from the original training data, the scope of the effect on classification results of interlaced data is limited to the particular linear regression model to which the data belong. Moreover, combining multiple linear regression models trained with interlaced data can produce more reliable results in classifying similar software.

In the experimental results, the proposed linear regression model with interlaced data improved the accuracy of the results in classifying similar software. This confirmed that separating the effects of the feature values of training data helps improving the classification accuracy of similar software in the linear regression approach. More generally, the proposed method of interlacing data can be effective at improving the accuracy of linear regression whenever training data contain noise feature values. The application of multiple models with interlaced data is anticipated to design improved linear regression models that reduce estimation errors in classification problems.

In this section, the existing methods for software similarity analysis are discussed. Table 4 shows the related work on software similarity analysis. According to the approaches for analyzing the similarity of software, the methods are classified into five groups. The source code analysis is optimized for source code comparison through neural networks [21,22], k-NN [23], or token strings [29]. These methods are limited to apply in environments where the source code of software is available. Software birthmark means the inherent characteristics of software that can be used to distinguish different software. This approach analyzes various characteristics of software, such as runtime API call sequences [1], whole program path [4], static information of Java software [5], k-gram of opcode [30], or dynamic opcode n-gram [31]. The analyzed birthmarks express the inherent characteristics of software and the birthmarks are compared to distinguish different software. This approach requires designing specific code analysis algorithms to extract the birthmark data based on information of interest. Similarly, the value-based approach focuses on the specific values of static or dynamic environments, such as features of API calls [6], the semantics of basic blocks [32], critical runtime value [33], or program logic [34]. Machine learning-based methods apply several approaches of machine learning to analyze the similarity of software, such as linear regression [14], support vector machine [17], and neural networks [20,24].

This approach classifies similar software through learning from previously known training data. So, the quality of training data and the design of the machine learning model are important. The proposed approach is a variation of machine learning with linear regression. The proposed method is an approach to improve existing methods by applying multiple models by interlacing input data. The interlacing of data tries to localize the effects of noise values that may be contained in training data, and the integration of multiple models can improve the reliability of analytical results. The software similarity analysis can be useful in various fields in computer science, such as detection of illegal code reuse, identification of similar algorithm or software, and malware detection.

Table 4.

Source code analysis | Software birthmark | Value-based | Machine learning | Proposed approach | |
---|---|---|---|---|---|

Analysis approach | Static | Static [5,29] or Dynamic [1,4,30] | Static [31] or Dynamic [6,32,33] | Static | Static |

Target software | Source code | Binary code | Binary code | Binary code | Binary code |

Comparison approach | Neural net [21,22] k-NN [23] Token [29] | API call [1] Whole program path [4] Static info. [5] k-gram [30] Dynamic opcode [31] | API call [6] Basic block [32] Critical runtime value [33] Program logic [34] | Linear regression [14] SVM [17] Neural net [20,24] | Linear regression of interlaced data |

Requirement | Applicable for source code only | Design of code analysis algorithm | Design of code analysis algorithm that is specific to the structure of binary code | Design of machine learning and training data | Design of multiple linear regression models on interlaced data |

Applicability | Detection of source code plagiarism | Comparison of binary code through analyzed code info. | Comparison of binary code through static or runtime data or values | Classification of similar software through machine learning | Classification of similar software through integration of linear regression |

Advantage | Optimized for source code comparison | Efficient comparison of binary code | Optimized for specific values or data | Comparison through machine learning | Improved comparison accuracy with multiple models |

The integration of three independent models can improve reliability and accuracy because they can be calibrated through different models, even if the prediction results are wrong in one model. On the other hand, there are several issues to consider to apply the proposed method. Although the accuracy and reliability of the classification result can be improved, the design of the proposed model is more complicated than the conventional model. In the design of multiple linear regression models, the procedure of several processes is required, including interlacing data, learning and generating independent models, integrating individual models, and finally predicting the results. Because the complex procedure can be overhead in the performance of learning and predicting results, the proposed model may be disadvantageous for applying in performance-critical environments such as real-time processing.

Overfitting is one of the main concerns in applying machine learning approaches to predict the results of real-world problems. The proper control of the occurrence of overfitting is important in designing machine learning models. Because the proposed model integrates independent multiple models for classifying similar software, overfitting may appear in different aspects depending on the features of the interlaced data of the models. Therefore, much more attention is needed to find and control the presence of overfitting compared to the existing ones. In future work, we plan to study how to find overfitting in different individual models and how to reduce the occurrence of overfitting in multiple linear regression models.

The importance of software in information systems has been continuously increasing in recent years. In view of this increasingly important role, ongoing effort is needed to better understand the characteristics of software. Such efforts will be helpful for improving productivity and safety in software development. In recent software analysis studies, machine learning approaches have become widely used. To accurately train models and improve accuracy in applying machine learning, it is important to reflect the characteristics of data in the training process.

Linear regression is widely used in estimation problems that can be solved by modeling a linear relationship between input and output data. The conventional linear regression model can be used to classify similar software by training with data representing the software features. In this paper, we proposed an approach to this kind of machine learning that involves applying multiple linear regression models generated by interlacing data, which is expected to improve classification accuracy for similar software. We presented a method of interlacing data to generate multiple linear regression models, including the design of the combined linear regression model derived from the interlaced data. We then conducted experiments to evaluate the proposed approach as compared to conventional linear regression models for classifying similar software, and the experimental results show that the proposed method can indeed classify similar software more accurately than conventional linear regression. The proposed approach is expected to be an effective method in linear regression contexts for improving the accuracy of results. The application of multiple models with interlaced data is anticipated to reduce estimation errors in classification problems appropriate for linear regression.

He received his B.S., M.S., and Ph.D. degrees in computer science from Korea Advanced Institute of Science and Technology (KAIST), Korea, in 1995, 1997, and 2009, respectively. He is currently a professor in the School of Computer Science and Engineering, Kyungnam University. His current research interests include software security, software analysis, machine learning, and program analysis.

- 1 H. Tamada, K. Okamoto, M. Nakamura, A. Monden, K. Matsumoto, "Dynamic software birthmarks to detect the theft of windows applications," in
*Proceedings of International Symposium on Future Software Technology (ISFST)*, Xian, China, 2004;custom:[[[-]]] - 2 S. Cesare,
*Software similarity and classification*, Ph.D. dissertation, Deakin University, Geelong, Australia, 2013.doi:[[[10.1007/978-1-4471-2909-7]]] - 3 H. Park, H. I. Lim, S. Choi, T. Han, "Detecting common modules in Java packages based on static object trace birthmark,"
*The Computer Journal*, vol. 54, no. 1, pp. 108-124, 2011.doi:[[[10.1093/comjnl/bxp095]]] - 4 G. Myles, C. Collberg,
*in Information Security*, Germany: Springer, Heidelberg, pp. 404-415, 2004.custom:[[[-]]] - 5 H. Tamada, M. Nakamura, A. Monden, K. I. Matsumoto, "Java birthmarks: detecting the software theft,"
*IEICE Transactions on Information and Systems*, vol. 88, no. 9, pp. 2148-2158, 2005.custom:[[[-]]] - 6 M. Alazab, R. Layton, S. Venkataraman, P. Watters, "Malware detection based on structural and behavioural features of API calls," in
*Proceedings of the 1st International Cyber Resilience Conference*, Perth, Australia, 2010;pp. 1-10. custom:[[[-]]] - 7 K. P. Murphy,
*Machine Learning: A Probabilistic Perspective*, MA: MIT Press, Cambridge, 2012.custom:[[[-]]] - 8 S. Shalev-Shwartz, S. Ben-David,
*Understanding Machine Learning: From Theory to Algorithms*, UK: Cambridge University Press, Cambridge, 2014.custom:[[[-]]] - 9 P. Domingos, "A few useful things to know about machine learning,"
*Communications of the ACM*, vol. 55, no. 10, pp. 78-87, 2012.doi:[[[10.1145/2347736.2347755]]] - 10 D. T. Ramotsoela, G. P. Hancke, A. M. Abu-Mahfouz, "Attack detection in water distribution systems using machine learning,"
*Human-centric Computing and Information Sciences*, vol. 9, no. 13, 2019.doi:[[[10.1186/s13673-019-0175-8]]] - 11 D. H. Kwon, J. B. Kim, J. S. Heo, C. M. Kim, Y. H. Han, "Time series classification of cryptocurrency price trend based on a recurrent LSTM neural network,"
*Journal of Information Processing Systems*, vol. 15, no. 3, pp. 694-706, 2019.custom:[[[-]]] - 12 M. J. J. Ghrabat, G. Ma, I. Y. Maolood, S. S. Alresheedi, Z. A. Abduljabbar, "An effective image retrieval based on optimized genetic algorithm utilized a novel SVM-based convolutional neural network classifier,"
*Human-centric Computing and Information Sciences*, vol. 9, no. 31, 2019.doi:[[[10.1186/s13673-019-0191-8]]] - 13 C. Cicceri, F. De Vita, D. Bruneo, G. Merlino, A. Puliafito, "A deep learning approach for pressure ulcer prevention using wearable computing,"
*Human-centric Computing and Information Sciences*, vol. 10, no. 5, 2020.doi:[[[10.1186/s13673-020-0211-8]]] - 14 H. I. Lim, "A linear regression approach to modeling software characteristics for classifying similar software," in
*Proceedings of 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)*, Milwaukee, WI, 2019;pp. 942-943. doi:[[[10.1109/compsac.2019.00152]]] - 15 W. Liu, P. Wang, Y. Meng, C. Zhao, Z. Zhang, "Cloud spot instance price prediction using kNN re-gression,"
*Human-centric Computing and Information Sciences*, vol. 10, no. 34, 2020.doi:[[[10.1186/s13673-020-00239-5]]] - 16 W. Li, X. Li, M. Yao, J. Jiang, Q. Jin, "Personalized fitting recommendation based on support vector regression,"
*Human-centric Computing and Information Sciences. https://doi. org/10.1186/s13673-015-0041-2*, vol. 5, no. 21, 2015.doi:[[[10.1186/s13673-015-0041-2]]] - 17 H. I. Lim, "Design of similar software classification model through support vector machine,"
*Journal of Digital Contents Society*, vol. 21, no. 3, pp. 569-577, 2020.doi:[[[10.9728/dcs.2020.21.3.569]]] - 18 M. J. Ding, S. Z. Zhang, H. D. Zhong, Y. H. Wu, L. B. Zhang, "A prediction model of the sum of container based on combined BP neural network and SVM,"
*Journal of Information Processing Systems*, vol. 15, no. 2, pp. 305-319, 2019.doi:[[[10.3745/JIPS.04.0107]]] - 19 M. Zouina, B. Outtaj, "A novel lightweight URL phishing detection system using SVM and similarity index,"
*Human-centric Computing and Information Sciences*, vol. 7, no. 17, 2017.doi:[[[10.1186/s13673-017-0098-1]]] - 20 N. Shalev, N. Partush, "Binary similarity detection using machine learning," in
*Proceedings of the 13th Workshop on Programming Languages and Analysis for Security*, Toronto, Canada, 2018;pp. 42-47. doi:[[[10.1145/3264820.3264821]]] - 21 M. White, M. Tufano, C. Vendome, D. Poshyvanyk, "Deep learning code fragments for code clone detection," in
*Proceedings of 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE)*, Singapore, 2016;pp. 87-98. doi:[[[10.1145/2970276.2970326]]] - 22 D. Heres,
*Master's thesis, Utrecht University, Utrecht, Netherlands, 2017*, Utrecht UniversityUtrechtNetherlands, Master's thesis, 2017.custom:[[[-]]] - 23 U. Bandara, G. Wijayarathna, "A machine learning based tool for source code plagiarism detection,"
*International Journal of Machine Learning and Computing*, vol. 1, no. 4, pp. 337-343, 2011.doi:[[[10.7763/ijmlc.2011.v1.50]]] - 24 N. Marastoni, R. Giacobazzi, M. Dalla Preda, "A deep learning approach to program similarity," in
*Proceedings of the 1st International Workshop on Machine Learning and Software Engineering Symbiosis*, Montpellier, France, 2018;pp. 26-35. doi:[[[10.1145/3243127.3243131]]] - 25
*Python programming language (Online). Available:*, https://www.python.org/ - 26
*Scikit-learn: machine learning in Python (Online). Available:*, http://scikit-learn.org/stable/index.html - 27
*ANTLR (ANother Tool for Language Recognition) (Online). Available:*, https://www.antlr.org/ - 28
*The Apache Jakarta Project (Online). Available:*, https://jakarta.apache.org/oro/ - 29 L. Prechelt, G. Malpohl, M. Philippsen, "Finding plagiarisms among a set of programs with JPlag,"
*Journal of Universal Computer Science*, vol. 8, no. 11, pp. 1016-1038, 2002.doi:[[[10.3217/jucs-008-11-1016]]] - 30 G. Myles, C. Collberg, "K-gram based software birthmarks," in
*Proceedings of the 2005 ACM Symposium on Applied Computing*, Santa Fe, NM, 2005;pp. 314-318. doi:[[[10.1145/1066677.1066753]]] - 31 B. Lu, F. Liu, X. Ge, B. Liu, X. Luo, "A software birthmark based on dynamic opcode n-gram," in
*Proceedings of the International Conference on Semantic Computing (ICSC)*, Irvine, CA, 2007;pp. 37-44. doi:[[[10.1109/icsc.2007.15]]] - 32 L. Luo, J. Ming, D. Wu, P. Liu, S. Zhu, "Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection,"
*IEEE Transactions on Software Engineering*, vol. 43, no. 12, pp. 1157-1177, 2017.doi:[[[10.1109/TSE.2017.2655046]]] - 33 Y. C. Jhi, X. Jia, X. Wang, S. Zhu, P. Liu, D. Wu, "Program characterization using runtime values and its application to software plagiarism detection,"
*IEEE Transactions on Software Engineering*, vol. 41, no. 9, pp. 925-943, 2015.doi:[[[10.1109/TSE.2015.2418777]]] - 34 F. Zhang, D. Wu, P. Liu, S. Zhu, "Program logic based software plagiarism detection," in
*Proceedings of 2014 IEEE 25th International Symposium on Software Reliability Engineering*, Naples, Italy, 2014;pp. 66-77. doi:[[[10.1109/issre.2014.18]]]