1. Introduction
At present, network communication technology and Internet of Things (IoT) are developing rapidly, and a variety of intelligent devices have emerged one after another, followed by a sharp increase in the amount of data [1]. Distributed cloud computing is used to process large amounts of data. This method mainly transmits all data collected by various intelligent devices to the cloud for deep processing. Its characteristic is that tens of thousands of data points can be processed in a short time. However, with the rapid development of data, it is difficult to meet the development requirements of Internet communication by relying solely on data processing methods based on cloud computing. Therefore, data processing from the cloud to the edge has gradually become popular, and the processing mode of edge computing has come into being.
The edge computing mode can provide service computing at the data source side without transmitting massive data to the cloud, which can not only improve the user experience but also reduce the pressure on cloud services [2]. The edge terminal device can complete computing tasks that need to be completed on an edge or cloud server. Distributing distributed training tasks of the model to the edge terminal devices can bring many advantages. On the one hand, deep learning (DL) technology requires massive data as the basis. The quantity of data directly determines the final performance of the DL model. Model training is directly deployed to the data source, avoiding the transmission of original data and reducing the transmission cost of equipment and cloud storage cost. On the other hand, the massive data in the edge terminal device involve user privacy (such as personal photos, track information, and APP usage records) being transmitted to the cloud for centralized training, which will cause serious data security problems.
Owing to their special network structure, edge network nodes are vulnerable to illegal intrusion threats; thus, certain security measures must be taken to deal with illegal intrusion [3]. Abnormal network traffic detection is an effective security defense strategy. By deploying an appropriate detection system in the edge network node, it can actively detect abnormal intrusion data in the network connection data passing through the edge network node and send an alarm in time to ensure the safe operation of the edge computing network system. Therefore, research on abnormal detection models and algorithms suitable for edge network nodes to detect illegal activity has great significance.
Owing to the limited computing power of edge devices, large-scale data training tasks cannot be performed. A DL model trained offline can be embedded in an edge device to detect abnormal traffic [4]. When abnormal results are detected, the edge device can record logs or send detection results to the cloud. Without considering the data transmission delay of the edge device, this study focused on improving detection performance. Thus, a combined framework was designed based on the transformer architecture, which consists of four layers: data preprocessing, data encoding, deep feature extraction, and abnormal traffic classification.
Section 2 describes the relevant work in network traffic intrusion detection, summarizes the problems of these methods, and elaborates on the study motivation. Section 3 introduces the overall system architecture and designed composite network model. In Section 4, the reliability of the designed model is verified through well-designed and rich experiments, and the advantages of the proposed model are verified by comparison with several newer models. In Section 5, the experimental implications are summarized, along with the shortcomings of the study and prospects for the next step of the work.
2. Related Works
With the successful application of machine learning (ML) and DL in other fields, network security has begun to use ML and DL models to achieve intelligent detection and improve network intrusion detection performance.
2.1 ML-based Methods
Traditional network security defense methods primarily include firewalls, antivirus software, and network security hardware products that detect intrusion traffic or virus programs through pattern matching. However, these methods have poor defense against new attack behaviors. To this end, the industry has introduced ML algorithms to improve the performance of various types of attacks by learning intrusiontraffic features.
For example, Hassan et al. [5] proposed a random forest-based method that has certain advantages in the multi-classification problem of unbalanced data. Salo et al. [6] proposed a network intrusion detection model that integrates information gain and ML. These methods have certain advantages; however, they do not significantly improve the detection rates of various attacks.
Zhang et al. [7] comprehensively considered the three dimensions of time, space, and content in data and proposed a multi-dimensional feature fusion and overlay integration mechanism. The classification accuracy was effectively improved by integrating multiple decision trees.
ML-based methods can analyze the surface features of data and achieve autonomous detection through feature learning. However, ML methods have the following three limitations [8]: overreliance on robust features, high-dimensional feature processing capability, and weak ability to extract dynamic features.
2.2 DL-based Methods
Compared with ML-based methods, DL-based methods can mine deeper data features, improve detection efficiency, reduce false positives, and help identify potential security threats in computer network systems [9].
To protect IoT systems from ransomware attacks, Al-Hawawreh and Sitnikova [10] proposed a detection model based on stacked variational self-coding. Liu et al. [11] proposed a bidirectional generative adversarial network (BiGAN)-based method for industrial control systems. However, these methods typically yield unsatisfactory results when addressing high-dimensional data.
To better process high-dimensional data, Li et al. [12] utilized DL autoencoders combined with coefficient penalties and reconstruction losses in the encoding layer to extract high-dimensional data features and then used extreme learning machines to quickly and effectively classify the extracted features. The authors of [13] used a preprocessing method combining dimensionality reduction technology and feature engineering to generate meaningful features and proposed two DL-based detection methods, achieving good detection results. In [14], principal component analysis (PCA) was used to simplify data characteristics based on data dimension and time series characteristics, and a stacked gated current unit (GRU) detection model based on transfer learning was used to conduct intrusion detection on the simplified characteristics. Although this method can achieve good results in simple network traffic attack detection, it cannot solve complex malicious botnet problems. For the complex malicious botnet problem, Liaqat et al. [15] designed a composite framework integrating convolutional neural network, deep neural network, and long short-term memory (CNN-DNN-LSTM), which can detect complex malicious botnets in a timely and effective manner in a medical IoT environment. However, this method does not consider the limitations of general convolution.
Saharkhizan et al. [16] proposed using LSTM to learn the dependency relationships between temporal data, using the LSTM set as a detector and combining the output of the detector into a decision tree for effective classification. However, it is not easy to solve the trade-off between "curse of dimensionality," accuracy, and other important performance indicators. To address this issue, Mushtaq et al. [17] proposed a hybrid framework (BiLSTM-AE) that combined a bidirectional LSTM (BiLSTM) and depth automatic encoder (AE). The best features were obtained using AE and then classified into normal and abnormal samples using BiLSTM. Although this method considers spatiotemporal and deep features, it ignores the correlations between network traffic. To address this issue, Li et al. [18] designed a new detection model with a three-layer packet flow. This model can simultaneously consider the spatiotemporal feature correlation both in and between network flows, thereby improving the network traffic classification performance.
Although these methods have achieved high detection rates, most of them do not attach great importance to the problem of dataset imbalance [19]. Fernando and Tsokos [20] proposed an intrusion detection model using a dynamic weighted loss function, and Liu et al. [21] proposed an intrusion detection model using difficult set sampling. These two models have certain effects on alleviating the class imbalance problem, but they cannot be used to solve the problem of minority class data generation. Zhang and Liu [22] combined DL and statistical concepts to solve the problem of few samples and proposed a fusion model based on an improved conditional variational automatic encoder (ICVAE) and boundary line synthesis few oversampling technology (BSM) for the IoT, called ICVAE-BSM, which has achieved significant results in solving class imbalance problems. However, the fusion model failed to effectively extract the interdependence and long-dependency features between network traffic, which affected the overall performance of the model to some extent.
2.3 Study Motivation
Based on the above methods, it can be concluded that most existing network intrusion detection methods have the following problems:
1) Trade-off between "curse of dimensionality," accuracy, and other important performance indicators;
2) Correlation problem between network traffic;
3) Long dependency feature extraction problem;
4) Deep feature extraction problem;
5) Class imbalance problem.
Based on these issues, a new detection model using a transformer is proposed. Firstly, Tomek Links [23], the SMOTE algorithm [24], and the Wasserstein Generative Adversarial Network (WGAN) [25] are used to preprocess the data to solve the class imbalance problem. Secondly, a transformer [26] is used to encode the data to extract the correlation between network traffic. Finally, a network model that integrates a bidirectional gated current unit (BiGRU) [27] and DNN [28] is proposed, which can avoid the "curse of dimensionality" and simultaneously mine local and global features of the data. Longdependent temporal features are extracted using BiGRU, while deep-level features are extracted by DNN.
3. Abnormal Detection Model using TEBGD Network
3.1 Framework of TEBGD
The proposed framework is shown in Fig. 1.
The detection method is mainly divided into two stages. The first stage involves checking the user consistency, mainly by checking the identities of known users. The historical behavior data of all legal users are trained into a multi-classifier, which is then used for the supervised classification and identification of users, including normal and abnormal legal users. The second stage is anomaly detection, which is primarily used to detect whether a single legitimate user is abnormal. The edge server is responsible for monitoring the abnormal behavior of the users.
Framework of abnormal detection model using TEBGD.
3.2 Transformer-Encoder-BiGRU-DNN (TEBGD) Model Architecture
The traditional intrusion detection algorithm can only detect attacks at the current time but cannot do anything about an attack with a long duration, resulting in a forgetting phenomenon in the iterative learning process. The Transformer-Encoder and BiGRU have solved the aforementioned problems and achieved satisfactory results in time-series processing. However, after a more in-depth analysis of the two, it was found that their respective disadvantages are obvious. The core of the Transformer-Encoder is dilated causal convolution, which is simple in structure. It offers the characteristics of convolution kernel sharing, low memory consumption, high computing speed, and ease of stacking. However, due to the unidirectional structure, the extraction of information is not sufficiently comprehensive; although the diffusion rate has been doubled and the receptive field has been expanded, it is still limited, which is quite different from LSTM and GRU. BiGRU makes full use of memory and is capable of processing long time-series data. However, its structure is complex and calculation time is long.
The combination of the two can complement their advantages and disadvantages, and the feature extraction is more comprehensive. Higher accuracy can be achieved at the cost of less time to obtain more optimized results. Therefore, this paper proposes a Transformer-Encoder and BiGRU to process time-series information.
The detection process mainly includes three stages: network traffic preprocessing, feature extraction using the fusion framework of the transformer and BiGRU-DNN, and network traffic classification. The overall architecture is shown in Fig. 2.
Architecture of TEBGD deep learning model.
3.3 Data Preprocessing
3.3.1 Imbalanced data processing
First, the raw data [TeX:] $$(X, Y)$$ are subjected to undersampling using the Tomek Links algorithm to eliminate noisy and boundary overlapping samples. The majority classes are then downsampled using the SMOTE algorithm, performing preliminary upsampling on the minority class data to generate data [TeX:] $$\left(X_{o s}, Y\right)$$, where [TeX:] $$\left(X_{s, l}, Y\right)$$ represents the real training data labeled l generated after the initial upsampling of SMOTE. Next, the random noise data [TeX:] $$X_r$$ pass through the generator to generate forged data [TeX:] $$X_{\text {fake }}.$$ [TeX:] $$X_{s, l} \text { and } X_{\text {fake }}$$ are used to iteratively train the generator for each class. Finally, the trained model is used to generate minority class data [TeX:] $$\left(X_g, Y\right).$$ The detailed process of data balancing is shown in Fig. 3.
Data balancing processing.
By creating fresh minority class samples between minority class samples, SMOTE provides data balance. This algorithm is prone to blurring the lines between majority and minority classes, making classification more difficult, and it is unable to solve the problem of data distribution marginalization in imbalanced datasets. The distribution of minority data cannot be fully learned by WGAN if the initial amount of data is too small. To overcome the aforementioned problems, WGAN is used to fully learn the distribution of minority class data based on the data provided by Tomek Links and SMOTE, thereby enhancing the quality of the generated data. The loss function is calculated as follows:
where Loss(c) and Loss(g) represent the loss functions of the discriminator and generator in WGAN, respectively; [TeX:] $$g_\theta$$ represents a generator in WGAN; [TeX:] $$f_w$$ represents the discriminator in WGAN; x represents real data; z represents random noise data; and m represents the size of a batch. When training the WGAN network, Adam can cause instability during model training. Therefore, RMSProp is used as the optimizer for WGAN network training.
3.3.2 MLP encoding
The data used in the experiment were discrete; therefore, one-hot encoding is used, inserting initial features, training it as a part of the whole, and then performing standard normalization on all data.
Similar to the word embedding layer in natural language processing, a multi-layer perceptron (MLP) [29] is used to encode each feature data, amplify and map features to different subspaces, extract richer features and achieve the input dimensions required by the model, and dynamically adjust MLP parameters during training.
3.4 Transformer-Encoder Model
The original model structure of the transformer includes two parts: encoding and decoding. Owing to the specific requirements of the detection tasks and fixed length of each data in the dataset, this model only uses the encoding part of the transformer and fine-tunes some of its parameters. The attention mechanism uses dot-product attention; that is, there are three inputs: query, keys, and values. Dot product attention can be used for parallel operations to reduce training time:
where A, B, and C represent the three matrices of query, key, and value, respectively, and w represents the dimensions of key. To enrich the extracted features, the structure of multi-head attention is used. The multi-head attention is defined as
where [TeX:] $$i=1,2,3, \ldots, n,$$ Q represents the weight, H represents head, and MH represents multi-head. The feed-forward neural network part is a perceptron with only one hidden layer, and its input and output dimensions are the same. The total number of neurons in the hidden layer is set to be twice that of the input layer because of the limited nonlinear mapping ability of the single hidden layer network and the trade-off between computational cost and mapping ability.
As is apparent from Formula (5), the activation function is a Gaussian error linear unit (GELU):
Fig. 4 depicts the overall structural layout of the Transformer-Encoder module. This section employs a residual connection to stop gradient disappearance.
Structure of Transformer-Encoder module.
3.5 BiGRU-DNN
The processing of BiGRU is shown in Fig. 5.
The processing of the BiGRU model is represented by the following formula:
where [TeX:] $$Q_F \text { and } Q_B$$ are the weighted factors of the BiGRU output layer.
The DNN in the proposed model has two hidden layers: ReLU and Dropout, where the Dropout is set to 0.5. The DNN structure is shown in Table 1, where b_s represents the batch size and * represents settings based on actual needs. If N classification tasks are performed, it is set to N. The general formula for the DNN calculation is as follows:
3.6 Classifier Design
After the model training is completed, the trained model is used to classify the test set and obtain the prediction type. A k-fold cross-validation method is used to test the model and ensure credibility of the test results. The softmax function is used to calculate the probability of the classification results and compare them with the original labels. The calculation of softmax is as follows:
In addition, for the multi-classification problem of abnormal traffic, a multi-classification calculation formula for softmax was designed as follows:
where [TeX:] $$h_\theta\left(x^{(i)}\right), \theta_0, \theta_1, \ldots, \theta_{k-1}$$ is the parameter to be determined and [TeX:] $$\frac{1}{\sum_{j=0}^{k=1} e^{\theta_j^T x(i)}}$$ is the normalization factor of the function.
4. Experiments
4.1 Environment
The abnormal detection system was developed using PC devices. It mainly uses Python language and its related library to realize data collection, data preprocessing, behavior detection, and analysis functions. The MySQL database management software is used to realize data storage function. The Django framework is used to build a front-end interface based on the http protocol, which can output behavior detection results.
Software configuration environment
The hardware configuration of the abnormal detection system is described in Table 2. The software environment of the abnormal detection system is described in Table 3.
4.2 Datasets
Three datasets, namely NSLKDD [30], UNSWNB15 [31], and CICIDS2017 [32], were used for the experimental evaluations. Based the NSLKDD dataset, the redundant and repetitive data in the training and test sets were eliminated so that the setting of the dataset was more reasonable and to obtain a more accurate detection rate. The UNSWNB15 dataset, released in 2015, overcomes the limitations of the NSLKDD dataset to a certain extent. The CICIDS2017 dataset was derived from network data collected by the Canadian Institute of Network Security from July 3 to 7, 2017, including benign and the latest common attacks in the field of network intrusion, filling the gap in that there are no network-based attacks in the UNSWNB15 dataset.
These three datasets are primarily used for multi-category attack prediction. In the NSLKDD dataset, there are five types: normal traffic, denial of service attack, port attack, empowerment attack, and remote user attack. The UNSWNB15 dataset contains ten types: normal, DoS, exploits, generic, conservation, words, shellcode, analysis, backlight, and fuzzer. In the CICIDS2017 dataset, there are 15 types in total, among which there are nine traffic types: BENIGN, DoS, portscan, DDoS, pattor, bot, webattack, infiltration, and heartbeat, obtained after merging those with similar nature of abnormal attacks.
4.3 Evaluation Indicators
In general, the performance evaluation of abnormal detection methods includes five evaluation criteria: accuracy, precision, recall, F1 value, and false alarm rate (FAR). These evaluation criteria are defined using four functions: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). TP is the number of samples that correctly predict attack samples as attack categories, TN represents the number of normal samples predicted as normal categories, FP represents the number of samples wrongly predicted as attack categories, and FN represents the number of samples wrongly predicted as normal categories. These functions can be obtained from the confusion matrix C. The elements [TeX:] $$C_{ij}$$ of confusion matrix C represent the number of samples that belong to category i predicted as category j. The formulas for accuracy, precision, recall, F1 value, and false alarm rate are as follows:
4.4 Model Performance Training
The proposed model was tested for multiple indicators, and the change trends of accuracy, precision, recall, F1 value, and false alarm rate with epochs on the NSLKDD, UNSWNB15, and CICIDS2017 datasets are shown in Fig. 6.
Experimental results of different evaluation indicators: (a) accuracy vs. epochs, (b) precision vs. epochs, (c) recall vs. epochs, (d) F1 value vs. epochs, and (e) false alarm rate vs. epochs.
The time consumed by the TEBGD-based abnormal detection method when using different datasets varied with the amount of data, as shown in Fig. 7.
As shown in Fig. 6, as the number of epochs increases, the detection performance of the proposed model shows an upward trend when using the three different datasets. Ultimately, convergence can be achieved on all three datasets, with accuracy rates as high as 98.96%, 83.23%, and 99.78%, respectively. The false alarm rates were 0.35%, 3.98%, and 0.28%, respectively, all of which are below 4%.
As shown in Fig. 7, when the number of samples is above 2.5 × 105, it only takes approximately 3 seconds to detect the user's abnormal behavior in the data. Therefore, the proposed abnormal detection method has superior detection performance and detection efficiency.
Experimental results of different evaluation indicators: (a) accuracy vs. epochs, (b) precision vs. epochs, (c) recall vs. epochs, (d) F1 value vs. epochs, and (e) false alarm rate vs. epochs.
4.5 Model Performance Analysis Experiment
4.5.1 Using k-fold cross-validation analysis
The k-fold cross-validation method was used to verify the intrusion detection performance of the proposed model. The accuracy and F1 values obtained as a function of k are shown in Fig. 8.
As shown in Fig. 8, after testing with k values ranging from 2 to 10, the accuracy and F1 values of the NSLKDD, UNSWNB15, and CICIDS2017 datasets all increased with increasing k values. This is because, as the k-value increases, the number of partitions in the dataset increases, and the data used as the training set will also increase. The more data used in training, the higher the final evaluation indicators, such as test accuracy, will be. The optimal accuracy of multi-classification in the NSLKDD dataset was 99.72% when k=10, and the optimal F1 value was 99.52% when k = 10. The optimal accuracy and F1 value for multi-classification in the UNSWNB dataset were also obtained at k = 10, with values of 84.86% and 84.12%, respectively. Similarly, the optimal multi-classification performance in the CICIDS2017 dataset was achieved at k = 10, with an accuracy and F1 value of 99.89% and 99.28%, respectively. When k = 10, the proposed model achieved good multi-classification performance, and as the number of folds increased, the number of samples for each attack or normal type also increased. Therefore, the model will be able to better classify them.
Detection performance changes on the three datasets corresponding to the k-crossover coefficient: (a) accuracy and (b) F1 value.
4.5.2 Performance impact of different encoding methods
Because DL can only process numerical data, it is necessary to convert irregular string content in the original dataset into numerical data. Currently, two common numerical methods exist: single-hot encoding and label encoding. To demonstrate the reliability of unique encoding in the proposed model, three datasets were preprocessed using different encoding methods. The detection results obtained after inputting the processed encoded data into the proposed DL network model are listed in Table 4.
Comparison of detection accuracy (%) corresponding to one-hot encoding and tag encoding
From Table 4, using the unique hot encoding method to digitize the string type features in the dataset has a slightly higher detection accuracy than using the label encoding method. This is because label encoding converts features into continuous numerical values, that is, numbering discontinuous features, which leads to size relationships between features and produces partial ordering, which has a certain impact on the classification performance. However, unique encoding can transform the discrete features of the original data into Euclidean space through a series of feature transformations, keeping the distances between features consistent, solving the aforementioned problems, and improving detection accuracy. Therefore, all the experiments with the proposed model used a single-hot encoding method to preprocess the data.
4.5.3 Performance influence of different pooling methods
In the pooling stage of the proposed model, average and max pooling are fused to improve the feature extraction capabilities. Three different pooling methods were used in the experiments to verify the superiority of the fusion method used. The detection results on the NSLKDD, UNSWNB15, and CICIDS2017 datasets are listed in Table 5.
Impact of three different pooling methods on detection results (%)
From Table 5, compared with the pooling scheme alone, the fusion method can achieve a higher detection accuracy. The averaging method can extract features with global significance, whereas the maximum method can extract local features with certain significance. By integrating global and local pooling, essential features can be extracted more accurately, thereby maximizing the feature extraction capability of the proposed model.
4.6 Comparison of Proposed and Several Other Advanced Models
The proposed model was compared with CNN-DNN-LSTM [15], BiLSTM-AE [17], and ICVAE-BSM [22] under the conditions of the three datasets, and the results are listed in Table 6 and Fig. 9.
Results of different methods using three datasets (%)
Based on the above results, several comparative models achieved satisfactory results on the three datasets. An analysis of the reasons for this shows that, due to the strong nonlinear fitting ability of neural networks, they can map any complex nonlinear relationship, and the feature extraction ability of several comparative models is strong, resulting in high accuracy.
Compared with the other models, the proposed model achieved the best results. Analyzing the reasons for this, both the CNN-DNN-LSTM and BiLSTM-AE models achieved good results. Compared with CNN-DNN-LSTM, the BiLSTM-AE model performed better, mainly owing to the powerful feature filtering ability of AE. Moreover, BiLSTM, through the stacking of two layers of LSTM, breaks away from the limitation that the model can only predict the output of the next time based on the temporal information of the previous time and can better combine context for output. However, the results of these two models are inferior to those of ICVAE-BSM, mainly because ICVAE has a strong feature encoding ability and alleviates the problem of data class imbalance through BSM before feature extraction. The problem of data class imbalance can significantly affect the performance of a model. The proposed model not only considers the problem of data class imbalance but also introduces a transformer with strong encoding ability. In addition, BiGRU can effectively extract the temporal features of data like BiLSTM, and DNN can be used to extract deep-level features. In other words, the proposed model simultaneously possesses the advantages of the CNN-DNN-LSTM, BiLSTM-AE, and ICVAE-BSM models; therefore, the proposed model can achieve the best results.
Results of different methods using three datasets: (a) NSLKDD, (b) UNSWNB15, and (c) CICIDS2017.
5. Conclusion
A combined framework integrating the transformer and neural network models was proposed to alleviate data imbalance factors in network traffic data that affect network intrusion detection performance. Experiments were conducted on the NSLKDD, UNSWNB15, and CICIDS2017 datasets, with detection accuracy rates of up to 99.72%, 84.86%, and 99.89%, respectively. Compared to other relatively new DL network models, the proposed model effectively improved the detection results, thereby improving the communication security of network data. Through an analysis of the experimental results, the following conclusions were drawn:
1) Integrating Tomek Links, SMOT E, and WGAN can ef fectively solve cla ss imbalance problems.
2) Using the transformer to encode data can effectively extract the correlation between network traffic, making it more conducive for the model to identify abnormal traffic.
3) Integrating BiGRU and DNN can effectively extract long-dependent temporal features of data, as well as deep-level features, thereby enhancing the global and local feature extraction capabilities of the model.
4) Although the proposed model achieved superior intrusion detection results, the model is more complex and needs to be trained in advance before it can be deployed to an edge device.
The proposed model can be improved to a lightweight model without affecting the performance of model detection for better application to real edge computing scenarios. In addition, the proposed model will be deployed in many real edge computing scenarios, and joint training will be conducted using big data from real network traffic in actual application scenarios, which will help improve the generalization ability of the model.