Article Information
Corresponding Author: Jinge Xing , xingjinge@neau.edu.cn
Changjian Zhou, Department of Modern Educational Technology, Northeast Agricultural University, Harbin, China, zhouchangjian@neau.edu.cn
Yutong Zhang, College of electrical & information, Northeast Agricultural University, Harbin, China, zhangyutong@neau.edu.c
Yunfu Liang, Department of Modern Educational Technology, Northeast Agricultural University, Harbin, China, yfliang@neau.edu.cn
Jinge Xing, Department of Modern Educational Technology, Northeast Agricultural University, Harbin, China, xingjinge@neau.edu.cn
Received: February 16 2022
Revision received: June 21 2022
Revision received: August 18 2022
Accepted: September 9 2022
Published (Print): February 28 2025
Published (Electronic): February 28 2025
1. Introduction
As the Internet continues to provide unparalleled convenience, the urgency of addressing cybersecurity issues has grown exponentially. It also has been designated as a national territorial security topic in many countries. Since the symptoms of intrusion are mainly visible on webpages, the majority of cybersecurity incidents are displayed through webpages. It is an important responsibility of cybersecurity personnel to protect webpages from being tampered. Recently, with the wide use of machine learning approaches by defenders and attackers, especially the rapid application of deep convolutional neural networks (deep CNN), the deep architecture models have pushed the automated cyber-threats detection technique to a new level. However, regardless of traditional machine method or deep learning method for cyber network, a large amount of training data is necessary. When building a specific application case, such as colleges, hospitals etc., it’s hard to get enough data for training a classifier.
Recently, machine learning based cybersecurity techniques have been further improved for proactive defense and detection of cyber-attacks, and these studies have achieved excellent performance in different tasks. Machine learning technique can automatically extract valuable features in massive datasets and make decisions based on them [1]. Machine learning-based cybersecurity methods can obtain satisfactory results when given massive training data, making it possible to detect attack variants, which are mainly divided into the following aspects.
(1) Content based webpage tamper-resistant detection method. This method attends to concentrate the file content and information of the website, which is necessary to read and write files continuously. He et al. [2] proposed a unified modeling language-based connected and autonomous vehicle (CAV) cyber-security framework. They designed two classifier models based on naive Bayes and decision tree, respectively. When identifying each type of communication-based attacks tasks, the decision tree model is more appropriate for communication attack detection since it requires a shorter runtime. Jaber and Rehman [3] focused on the internet pay-per-use system, the authors proposed a fuzzy k-means clustering algorithm-based intrusion detection system, which can detect the anomalies with low false positive rate and high detection accuracy compared with existing mechanisms. Kumar et al. [4] deeply analyzed the security of social networks and found that there are a great number of users were unaware of the privacy concerns. The authors investigate the evolution of online social networks and discuss various security models using machine learning and deep learning methods. Finally, the authors gave a better solution for protecting personal privacy.
(2) Low level features-based webpage tamper-resistant detection methods. Al-Eidi et al. [5] proposed an automated covert temporal channels detection using image processing method. This method can detect the malicious part automatically in covert channels and reduce the quality-of-service degradation caused by blocking the entire traffic in the hidden channel. It achieves detection accuracy and covert traffic accuracy of 95.83% and 97.83%, respectively. Nowroozi et al. [6] gave a survey of adversarial image forensics using machine learning models to enhance the robustness of machine learning binary operation detector in various confrontation scenarios. Sarker et al. [7] proposed an intrusion detection decision tree security model by ranking the security features in the order of their importance, and building an intrusion detection decision tree based on the order of important features. Experimental results showed that this model is better than the existing models. Dehghani et al. [8] proposed a false data injection attacks detection method. The factors and wavelet features were adjusted and extracted, defining the input indexes based on deep learning, and a cyber-protection method was proved a high accuracy. Yavuz et al. [9] proposed an Internet of Things (IoT) routing attacked detection method based on deep learning. Since the Cooja IoT simulator generated high-fidelity attack data within 10 to 1,000 nodes of IoT networks, a highly scalable IoT routing attacked detection methodology was designed. The accuracy and precision of the proposed method is satisfactory in IoT cybersecurity area. Ko et al. [10] analyzed the DDoS attack vector of malware facilitated, the infection of 5 new devices per minute attracted by DDoS, and proposed a stacked self-organizing map method based on deep learning. Yuan et al. [11] proposed a byte level malware classification method based on Markov image, which adopted bytes transfer probability matrix, and then input them into deep architecture models. It achieved the 99.264% and 97.364% accuracy on the datasets of the Drebin and Microsoft.
Most of the above works have achieved a high-level performance in specific areas. However, the content-based webpage tamper-resistant detection method needs frequent read and write operations [12,13], which has a waste of time and no guarantee of timeliness. Low level features-based webpage tamper-resistant detection methods pay more attention to traffic characteristics, which needs rich network security knowledge reserve. Unfortunately, there is a shortage of such talents in society [14]. Since the network environments change quickly, traditional machine learning methods are difficult to adapt to the various attacks [15]. Deep learning methods may be effective tools for protecting information systems from attacks, however, due to the constantly evolving hacker attacks, preventing information systems from being invaded is still a great challenge for cybersecurity researchers [16,17].
To address the high frequently read and write requirements of traditional low level based webpage tamper-resistant detection methods, this work thought in a different way and proposed a deep residual auto-encoder and SVM combined intrusion detection algorithm named RAE-SVM. The RAE-SVM method detects webpage anomaly using the image features based on deep and shallow learning and without large training data. In addition, the RAE-SVM requires fewer professional network security knowledge to achieve a high detection accuracy. The main contributions of this work are as follows.
· A novel residual attention based auto-encoder and SVM combined approach for webpage tamper-resistant detection is proposed, which takes advantage of the residual network and SVM.
· The proposed model only needs a small amount of training data to get excellent performance.
· The residual attention block is proposed to adjust the weight value of residual connection adaptively.
The rest of this article is organized as follows. The related works are stated in Section 2. Section 3 discusses the proposed method. The experiment result analysis and discussion are detailed in Section 4. Section 5 provides the conclusions.
2. Related Works
In this study, the two different machine learning methods are combined for webpage tamper-resistant detection task. The deep learning method is used for feature extraction and the shallow learning methods such as SVM is introduced for feature classification. This work contrasted various deep learning methods and adopted deep residual autoencoder and support vector machine combined method for feature extraction and classification, respectively. A brief review of the two branches is given as follows.
2.1 Deep Residual Autoencoder
Auto-encoder is one of the classic artificial neural networks which consists of encoder unit and decoder unit [18]. Consider [TeX:] $$X=\left\{x_1, x_2, \ldots, x_n\right\}$$ as the input features space, and Y as the feature representation space. The autoencoder aims to find a mapping function f, which finds the minimum loss of X and Y. To improve the feature encoder capability, the state-of-the-art deep architecture residual network is employed in the encoder unit, which takes advantage of the strong feature expression ability and makes the encoded feature more representative. Deep residual autoencoder has achieved strong feature representation and dimensionality reduction in various tasks [19].
2.2 Support Vector Machine
SVM is one of the most powerful and robust approaches in the advance of limited training data [20], which aims to find a Decision Hyperplane as indicated in Fig. 1.
After extracting the features of captured screenshots images, we pass them into SVM classifier to calculate the maximum value of d, which can effectively satisfy the classification in Eq. (1). Since the SVM classifier principle is very familiar to us, the detailed derivation will not be repeated in this paper.
where y is the labels, x is the input feature, The parameter d is calculated in Eq. (2):
where d represents the distance from the vector points to hyperplane. The Lagrangian function is introduced as shown in Eq. (3):
where the factor [TeX:] $$a_i \geq 0,$$ and the optimal approach is shown in Eq. (4):
Thus, the parameter d is maximized while [TeX:] $$\frac{1}{2}\|w\|^2$$ is minimized.
The SVM classifier is used for calculating the maximum value of d and generating the webpage images, which are not in the baseline category, and we believe that the generated webpages are tamper-resistant.
To reduce the amount of calculation, numerous kernel functions such as Sigmoid, Anova, Gaussian, and liner kernel function are employed in various downstream tasks. The Gaussian kernel function is most widely used because it maps finite dimensional data to high-dimensional space.
The diagram of support vector machine.
3. Proposed Method
Although there are many variants and novel ways of protecting information systems from intrusion, there are still countless attacks every day. In this paper, we had taken another way, drew our inspiration from the perspective of network user behavior, and proposed a deep and shallow learning combined method for webpage tamper-resistant detection.
3.1 Model Architecture
To improve the feature representation ability, a robust deep learning architecture for feature extraction is necessary. In this work, a residual attention auto-encoder and SVM combined architecture for feature learning and extraction is proposed, as demonstrated in Fig. 2. The deep residual attention autoencoder approach RAE-SVM is presented for detecting the anomalous webpages. Firstly, the web crawler tools are employed for grabbing all the webpages screenshots within the preset domain name and establishing index marks with the domain name. Then the model identifies whether this domain name appears for the first time and checks whether it is abnormal. If not, put it into the classifier for feature extraction, and the extracted features are used as the baseline features. If the domain name does not appear for the first time, then input them into the classifier for prediction. Once the prediction results indicate that the webpage is abnormal, an alarm will be sent automatically.
Compared to other deep learning methods, the deep residual autoencoder considers the relationships between the features, eliminating irrelevant and visual attention redundant features, makes it accomplish strong feature representation and ensures to preserve the features spatial locality. This method largely resolves the network degradation and gradient vanishing problem, allowing it to maintain an excellent performance in feature representation stage.
The architecture of RAE-SVM.
3.2 CNN and Pooling Block
CNN and pooling block undertakes the task of dimension reduction and feature fusion. The convolution layer with Conv(1×1) is employed for feature integration and the Poolingis introduced for dimensionality reduction. In this work, we proposed a novel Pooling method, as shown in Eq. (5):
where Pooling means the proposed pooling operation, Concate means concatenating maxpooling and averagepooling methods on the channel dimension.
3.3 Attention Block
The architecture of attention based deep residual autoencoder is illustrated in Fig. 3. The attention block in RAE-SVM model is defined as Eq. (6):
where * represents the convolution operation, I denotes the output of the L-th layer. [TeX:] $$W \in[0,1]$$ denotes the weight matrix [TeX:] $$f(\cdot)$$ is the residual connection operation, [TeX:] $$r_L$$ denotes the residual block features of the L-th layer. This architecture works well as the attention based feature monitors which suppress the redundant features to improve the weight of valuable features. The decoding approach is represented as shown in Eq. (7):
where T is the matrix transposition unit.
The architecture of attention based deep residual autoencoder.
4. Experiment Analysis and Discussion
4.1 Experimental Environment
This study employs the high-performance computing (HPC) platform with Cent OS 8 Linux operating system, where 2×NVIDIA 2080 Ti graphics processing units (GPU) are adopted for accelerating calculation speed. In this work, 1,291 screenshots of second-level images were captured site under more than 50 websites within the domain name of Northeast Agricultural University. Part of the captured webpage images are shown in Fig. 4(a). The invaded web images are annotated manually by adding random images, and part of them are shown in Fig. 4(b). The distribution of positive and negative samples can be visualized by t-Distributed Stochastic Neighbor Embedding (t-SNE) in Fig. 5, where the blue dots denote the normal web images and the red ones denote the tampered web images. It is obvious that the normal and tampered webpage images are hardly distinguished from each other.
(a) Part of the captured webpages images and (b) part of the tampered web images.
(a) Original training data visualization and (b) the visualization extracted features.
4.2 Training Details
In this work, there are 1,291 positive instances and 2,000 negative instances collected. All of the raw data was curated into three classes, 60% for training, 20% for validation, and 20% for testing. All of the simple images are resized to 512×512 pixels, the batch size is set as 8, and the RMSProp and Binary crossentropy functions are adopted as optimizer and loss function respectively. The two-activation functions such as Sigmoid and ReLU are introduced in this work as shown in Eqs. (8) and (9):
where t is the original prediction probability value. Sigmoid function is adopted in full connection layer, which aims to map variables in [0,1]. And ReLU function is utilized in the residual block to alleviate gradient disappearance.
In this work, the webpage image features were extracted before being passed into SVM classifier. Through comparative analysis, we extracted the codes in deep residual autoencoder as the features of samples. The distribution of extracted features can be shown by t-SNE in Fig. 5(b). It can be seen that the features present a clustering trend but less than ideal. Therefore, the SVM classifier with powerful classification ability introduced in this work is necessary. The radial basis function (RBF) kernel function in SVM is the mapping gap between low and high dimension which is detailed in Eq. (10):
where K is the kernel function, σ is the constant parameter and x is the vector.
4.3 Result Analysis and Discussion
4.3.1 Evaluation metrics
To verify the effect of the proposed RAE-SVM model in a more comprehensive way, it is essential to compare the model with the traditional classical models and the results published recently. The selected evaluation index is shown in Eqs. (11)–(14):
Precision is mainly used to measure the prediction results, and predict the correct probability in the positive sample.
Recall is mainly used to measure the index of the sample, which is used to show how many positive examples in the sample are predicted correctly.
F1-score is an important indicator of the model to evaluate the binary classification, which takes both precision and recall into account.
Accuracy is a common indicator mainly for all samples, and it is one of the important indicators for comprehensive evaluation of the model.
4.3.2 Ablation study and comparison
To verify the superiority of the proposed method and the basic components, this work adopted the classic state-of-the-art models such as ResNet-50, SVM, k-nearest neighbor (KNN), AlexNet, DenseNet-121 for comparison. In addition, the recently released webpage tamper-resistant detection approaches such as PCA+SVM method [21], Autoencoder & SVM [22], SnapCatch [5], IntruDTree [7], and Dehghani [8] are also employed for comparison. The metrics of all methods are shown in Table 1.
It shows that the KNN method has the lowest accuracy compared with the deep set and the combined models; this is mainly because the deep learning models can generally achieve a better performance than the traditional machine learning models. As the classic deep learning model, deep CNN and ResNet-50 achieved 68% and 71% accuracy, respectively, which are lower than expected. The main reason is that the limited training samples and the negative samples are various, which made the deep-set classifier model cannot learn strong feature representation. The accuracy of PCA+SVM based method is 3% lower than that of the autoencoder & SVM approach. The best performance of the proposed method is predictable, and it is also proven that the deep residual auto-encoder has a great advantage in feature extraction.
4.3.3 Discussion
The proposed method breaks through the limitations of conventional thinking in cybersecurity, and detects network intrusion from the perspective of high-level image semantics, which is an effective supplement to the traditional cybersecurity methods. By comparison with other methods, the experimental results of the proposed approach proves that it can make a better performance.
The difference between deep autoencoder and the deep residual auto-encoder method is mainly as follows. One is that the deep residual autoencoder constructs a short residual connection between input layer and code layer, so the gradient vanishing phenomenon is limited by optimization and can obtain a good training result to improve the image features representation ability. The other is that the decoding evaluation standard is different. The purpose of autoencoder is to restore image, while the task of proposed deep residual autoencoder is to find the features to identify abnormal images. In addition, the proposed method has strong learning capacity on small samples, which combines the advantages of deep residual network, deep autoencoder and SVM methods, and shows strong ability of feature representation and feature classification. However, when the webpage has picture carousel or color change magic effect, the proposed method might encounter the false negative phenomenon, which needs to be improved in future work.
5. Conclusion
With the increasingly severe situation of cybersecurity, despite the variety of cybersecurity measures and devices, the cybersecurity staff are required to be on duty day and night, which greatly increases the cost of manpower. An unattended system is an urgent demand for the network security staff. This work is committed to address this issue. We analyzed the deep and shallow learning models, and modified the model architecture to achieve the purpose. A deep residual autoencoder based feature extraction method was proposed, combined with the SVM method to detect the invaded webpage images. Experimental results showed that the accuracy of the proposed RAE-SVM method achieves 95%, which meets our satisfaction performance and provides a novel approach for cybersecurity based on machine learning and computer vision.