1. Introduction
Information applications, which have become foundational services of the current society, have been suffering various threats, such as worms, Trojan, denial of service attacks, phishing and botnets. As the second line of security defense after a firewall, intrusion detection becomes an irreplaceable component to ensure system security.
To detect invasion threats, various machine learning models [1,2] have been used in intrusion detection. However, they suffer from long training times, parameter tuning issues and poor generalization [3-5]. On the other hand, the kernel trick has been widely employed in machine learning, as well as the extreme learning machine (ELM) [6]. In recent years, although the ELM and its variants have been applied to intrusion detection [7-14], little work has been published on the optimization of ELM and its variants.
The grid search method [7] is often used to optimize model parameters, which have a great impact on the performance of the model. However, optimal results depend heavily on the step it chooses. The Levenberg–Marquardt (LM) method [13,14], based on the principle of gradient descent, easily obtains the local optimal solution. Particle swarm optimization (PSO) [15], which has few parameters to tune, is an effective heuristic method. Feature selection, which is divided into two types, wrapper and filter, can also directly affect the accuracy and generalization performance of classifiers. The filter method has nothing to do with the subsequent learning algorithms. The wrapper method takes the data processing tasks (such as higher classification accuracy) as a guide [16,17]. The wrapper method usually generates a better result than the filter method. The grid search method and LM method cannot be used for feature selection, but PSO can be used both to optimize parameters and to select features. Therefore, PSO is the preferred algorithm for model optimization.
This work proposes the combination of kernel extreme learning machine (KELM) [18] and PSO for intrusion detection. To improve the classification accuracy and generalization performance of ELM, a kernel function is applied in KELM. Due to the effectiveness of the Gaussian function and adjustment of fewer parameters, this paper uses the Gaussian function as the kernel function [19]. Two parameters, [TeX:] $$\gamma$$ and C, both have great influence on the KELM [7,18], where γ is a kernel parameter impacting the classification result and C is a positive number making the performance stable. Meanwhile, feature selection is also a key problem building an efficient classification model. Although the raw feature set offers a detailed description, it may contain redundant information that directly affects the performance of the classifier [20]. The proper selection of effective features makes it possible to prominently accelerate the training and testing processes and maintain performance. In other words, the effectiveness of the classifier is influenced by two factors: parameter and feature selection [21]. The model parameters and the feature subset must be determined simultaneously because the feature subset and kernel parameter will affect each other [22]. In our model, the hybrid PSO is used to determine the parameters and the feature subset. Simulations conducted on three public datasets have shown that the parameters and the feature subset can be determined simultaneously. The proposed model obtains almost the same or better accuracy using approximately one quarter of the raw input features.
The general structure of this article is as follows. The second section includes the related works. The background is provided in the third section. Section 4 describes the hybrid PSO-BPSO based KELM model. Section 5 outlines the experimental results and discussion. Conclusions and possible extensions are presented in Section 6.
2. Related Work
The ELM and its modifications have been applied to intrusion detection systems (IDSs). Cheng et al. [7] show that ELM is faster than support vector machines (SVMs), and the accuracy of the KELM is better than that of SVMs in multiclass classification. Because of the large network traffic and low detection rate involved in intrusion detection, Singh et al. [9] presented a method based on the online sequential ELM. The paper uses alpha profiling to reduce the time complexity and filtering to select features. Accounting for the low accuracy of individual classifiers, Fossaceca et al. [10] introduced a framework named MARK-ELM which combines the results of multiple classifiers to obtain a final decision. The KELM was chosen as its key classification algorithm. The author emphasized multiple kernel learning and did not consider the parameter optimization and feature selection problems.
Many techniques can be adopted, such as evolutionary algorithms and swarm intelligence [12], to optimize the classifier models. Kuang et al. [23] applied kernel principal component analysis (KPCA) to determine features and adopted GA to optimize the parameters of the detection engine SVM. Onan et al. [24] introduced a weighted voting ensemble model in which a differential evolution algorithm is used to determine the weight of each classifier in the ensemble model. Zhang et al. [25] and Huang and Dun [26] applied ant colony optimization (ACO) and PSO to determine the parameters of the SVM. Shen et al. [27] adopted a bat algorithm to optimize an ensemble model. Bao et al. [28] applied a PSO-based memetic method to determine the parameters of an SVM. Ahila et al. [29] used PSO to select a feature subset and the number of hidden nodes in an ELM.
There are many studies to optimize the KELM model. Pan et al. [12] applied quantum PSO to modify the hidden nodes of the kernel extreme learning machine. Jayaprakash and Murugappan [13] employed the LM algorithm to determine the hidden nodes of the kernel extreme learning machine. Jaiganesh and Sumathi [14] presented a kernelized ELM with LM learning, in which the kernel parameters were tuned using LM. The KELM mentioned in [12-14], which only replaced the activation function with the kernel function, is the basic ELM in essence, thus it differs from the optimized KELM described here. Therefore, only the number of hidden layer nodes or the kernel parameter need to be optimized [7]. In addition, the model they proposed is unstable, because they do not apply the positive value C. Ma et al. [30] applied a self-adaptive artificial bee colony to optimize the kernel parameters and parameter C of the KELM. Unfortunately, the authors neither considered feature selection nor applied the model in the intrusion detection area.
3. Background
3.1 Kernel Extreme Learning Machine (KELM)
The ELM evolved from the single-hidden layer feed-forward neural network (SLFN). The SLFN includes an input layer, a hidden layer and an output layer. The number of nodes in the input, hidden and output layers of the SLFN are n, L and m respectively. There are [TeX:] $$Q \text { samples }\left\{\left(x_{i}, t_{i}\right)\right\}, \text { where } i=\{1, \ldots, Q\}, x_{i}=\left[x_{i 1}, x_{i 2}, \ldots, x_{i n}\right]^{\mathrm{T}} \in R^{n}, \text { and } t_{i}=\left[t_{i 1}, t_{i 2}, \ldots, t_{i n}\right]^{\mathrm{T}} \in R^{m} . x_{i} \text { and } t_{i} $$ represent the features and the label of the i-th instance, respectively. The model can be expressed as
where [TeX:] $$\omega_{i}$$ represents the input weight, [TeX:] $$b_{i}$$ represents the threshold of the i-th hidden neuron, [TeX:] $$\beta_{i}$$ is the output weight, and g(x) is the activation function. Eq. (1) can be expressed as
where [TeX:] $$\omega_{i} \text { and } b_{i}$$ are produced randomly. In the case of [TeX:] $$L \square Q, \beta$$ can be acquired by calculating the least square error solution of the linear system [TeX:] $$\mathbf{T}=\mathbf{H} \beta$$ [6]:
where [TeX:] $$\mathbf{H}^{\dagger}$$ represents the Moore-Penrose generalized inverse matrix. Here, the singular value decomposition method [31] is adopted to calculate [TeX:] $$\mathbf{H}^{\dagger}.$$
However, in practical applications, the final estimation results may be inaccurate because of multicollinearity. Huang et al. [18] introduced parameter I/C in a diagonal matrix, which makes the performance of ELM more stable. The improved generalized inverse matrix and the output weight are represented as
To further enhance the stability and generalization capability of ELM, KELM was introduced [18]. A matrix [TeX:] $$\Omega_{E L M}$$ can be constructed to replace [TeX:] $$\mathbf{H H}^{\mathrm{T}}.$$ The kernel matrix can be shown as
The Gaussian kernel function is employed in this paper, so [TeX:] $$K(u, v)=\exp \left(-\left(\|u-v\|^{2} / \gamma\right)\right),$$ where [TeX:] $$\gamma$$ is a kernel parameter. The output of KELM is
3.2 Particle Swarm Optimization (PSO)
PSO can be divided into standard PSO and binary PSO. In standard PSO, the velocity and position are denoted by [TeX:] $$V_{i}=\left(v_{1}, v_{2}, \ldots, v_{r}\right) \text { and } X_{i}=\left(x_{1}, x_{2}, \ldots, x_{r}\right),$$ ), respectively. There is another important attribute called fitness, which is the metric of measuring a particle. The particles can obtain their own optimal position (pbest) by their own experience and obtain the global optimal position (gbest) by the experience of all particles. The update rules for velocity and position are as follows:
where k indicates the current number of iterations. c1 and c2 are positive acceleration coefficients.
The positions in the standard PSO are claimed to be continuous, but in practice, they may be discrete. Therefore, the discrete particle swarm optimization algorithm, namely, binary PSO [32], is developed. In the BPSO, the position of a particle is denoted in binary, which is defined as
where s() represents the sigmoid curve.
4. The Hybrid PSO-BPSO based KELM Model
This section describes the hybrid PSO-BPSO based KELM model (hereafter referred to as the hybrid PSO-KELM). The mechanism of the method is shown in Fig. 1.
In the training phase, PSO and BPSO are used to train the KELM on the training dataset to obtain the optimal parameters and features. The training procedure is an iterative process in PSO. After the first stage, the structure of the KELM network is fixed. Then, according to Eq. (8), the final classification result (i.e., the prediction of the labels) is obtained.
The mechanism of the hybrid PSO-KELM.
4.1 Particle Representation
The particle consists of the feature mask, parameters C and [TeX:] $$\gamma$$. The feature mask is represented as binary data, where “1” indicates that the corresponding feature is chosen, and “0” indicates that it is not chosen. One can suppose that the dataset has n features, so the feature mask includes n bits. Parameters C and [TeX:] $$\gamma$$ are real numbers. Therefore, the particle can be represented by the (n+2)-dimensional vector shown in Table 1.
The [TeX:] $$1 \times(n+2)$$ vector represents the position of the particle. However, the basic PSO can only handle a continuous position, while the binary PSO can only update a binary position. To update the position of our hybrid-vector, the basic PSO and binary PSO are combined [31]. If a particle’s position of (1–n) dimensions is to be updated, then binary PSO will be applied; if a particle’s position of (n+1) – (n+2) dimensions is to be updated, then standard PSO will be applied.
4.2 Fitness Function Definition
The fitness function is used to improve the accuracy and reduce the number of features used. It can be defined as
where acc denotes the accuracy; [TeX:] $$n_{F}$$ denotes the size of all the features; [TeX:] $$f_{i}$$ is the state of the i-th feature, and [TeX:] $$f_{i}=1$$ if i-th feature used, or [TeX:] $$f_{i}=0,$$ otherwise; and [TeX:] $$w_{1} \text { and } w_{2}$$ represent the weights of the corresponding measures.
5. Experimental Results and Discussion
5.1 Evaluation
The accuracy, detection rate and false positive rate are the most common performance evaluation criteria. Therefore, the performance evaluation of the proposed hybrid model uses these three evaluation parameters as presented in formulas (13), (14) and (15).
The accuracy (Acc) indicates the proportion of samples correctly judged.
The detection rate (DR) represents the proportion of attack instances correctly detected by the model.
The false positive rate (FPR) represents the percentage of normal instances judged as attack instances.
where TP denotes the number of attack instances correctly judged, FP denotes the number of normal instances misjudged to be attacks, FN indicates the number of attacks classified as normal, and TN denotes the number of correctly judged normal instances.
5.2 Datasets
The KDD99 [33], NSL [34], and Kyoto datasets [35] were employed to validate the proposed model. KDD99 is the most widespread dataset for intrusion detection [36]. The NSL evolved from KDD99 and removed a large number of duplicate records. Its data features are the same as those of KDD99. Another more recent labeled dataset named Kyoto is also used. The Kyoto dataset is obtained from diverse types of honeypots consists of 24 features in which 14 statistical features are derived from KDD99 and 10 features are newly added. We selected 17 features, which are shown in Table 2 [37], for the experiment in this paper. Due to space limitations, the features of KDD99 are omitted here.
Features used and their representations in the Kyoto dataset
5.3 Experimental Procedure and Results
The value ranges of C and [TeX:] $$\gamma$$ are [TeX:] $$\left[2^{-10}, 2^{3}\right] \text { and }\left[2^{-3}, 2^{10}\right]$$ respectively. The max value of the velocity is set to approximately 20% of the range of the variables, so the velocity of parameter C is restricted to the range [-1.6, 1.6] and to [-204.8, 204.8]. For the discrete particle of binary PSO, the max velocity is restricted between [-4, 4]. The personal and social learning factors [TeX:] $$\gamma$$ are set to be in (2, 2). The inertia weight [TeX:] $$\left(c_{1}, c_{2}\right)$$ [TeX:] $$\left(C_{1}, C_{2}\right)$$ is set to 0.72. The population quantity and the number of iterations are 20 and 80, respectively. We randomly selected 50 groups of data from the original datasets for the experiment, and recorded the average results as follows.
Adjusting the values of parameters [TeX:] $$\omega_{1} \text { and } \omega_{2}$$ on KDD99: (a) accuracy with [TeX:] $$\omega_{1}$$ increased and (b) number of selected features with [TeX:] $$\omega_{2}$$ increased.
Adjusting the values of parameters [TeX:] $$\omega_{1} \text { and } \omega_{2}$$ on NSL: (a) accuracy with [TeX:] $$\omega_{1}$$ increased and (b) number of selected features with [TeX:] $$\omega_{2}$$ increased.
Adjusting the values of parameters [TeX:] $$\omega_{1} \text { and } \omega_{2}$$ on Kyoto: (a) accuracy with [TeX:] $$\omega_{1}$$ increased and (b) number of selected features with [TeX:] $$\omega_{2}$$ increased.
First, as discussed in Section 4.2, suitable values for [TeX:] $$\omega_{1} \text { and } \omega_{2}$$ need to be determined. To select pro¬per values for the proposed hybrid PSO-KELM, different values of [TeX:] $$\omega_{1} \text { and } \omega_{2}$$ are analyzed on the three datasets as seen in Fig. 2–4. Regardless, the fitness is mainly determined by the accuracy rather than the number of selected features. Therefore, we only selected several sets of weight parameters for the experi¬ment. Using several values adjusted from 0.8 to 0.99 with interval 0.05 and the last interval 0.04 for [TeX:] $$\omega_{1},$$ the train accuracy and test accuracy are shown in Figs. 2(a), 3(a), and 4(a), which reveal that the test accuracy remains almost unchanged when [TeX:] $$\omega_{1}$$ is greater than 0.95. Furthermore, the number of selected features using different values of [TeX:] $$\omega_{2}$$ is presented in Figs. 2(b), 3(b), and 4(b). It is observed that the number of the selected features decreases with the increase of [TeX:] $$\omega_{2}.$$ This means that the number of chosen features is dependent on the value of [TeX:] $$\omega_{2}.$$ We chose values [TeX:] $$\omega_{1}=0.95 \text { and } \omega_{2}=0.05$$ for further testing.
The experimental results of the Grid-KELM, continue PSO-KELM, hybrid PSO-KELM and GA-KELM methods on the three datasets are shown in Tables 3–5. For Grid-KELM and continuous PSO-KELM, only parameter optimization can be done, but feature selection cannot be done. For the hybrid PSO-KELM, parameter optimization and feature selection can be carried out simultaneously. GA can also optimize the parameters and select the features simultaneously. Therefore, the writing method in the table is used to show the difference. In Table 3, the average test accuracy of Grid-KELM is 98.2997%. For the continuous PSO-KELM method, the average test accuracy is 98.3165%. For the hybrid PSO-KELM, the average test accuracy is 98.2492%, the average number of selected features is 14, while the testing time is 0.7791 seconds. There is no doubt that the hybrid PSO-KELM method can determine the parameters and the features used at the same time. Compared with the continuous PSO-KELM method, the testing time of the hybrid PSO-KELM decreased by 1.6%. Even when the dimension has been reduced to 14, the reduction of testing time is not too much. This result is understandable because the dimension of the input space has no direct effect on the size of the kernel matrix. However, generally, the computational complexity of the kernel matrix depends on the number of samples and the dimension of the input features.
Experimental results of the Grid-KELM, continue PSO-KELM, hybrid PSO-KELM and GA-KELM methods on KDD99
Experimental results of the Grid-KELM, continue PSO-KELM, hybrid PSO-KELM and GA-KELM methods on NSL
Experimental results of the Grid-KELM, continue PSO-KELM, hybrid PSO-KELM and GA-KELM methods on Kyoto
In other words, feature selection is still important in reducing the computational complexity of kernel mapping. Table 4 shows that the NSL is more demanding with respect to the method, i.e., the testing accuracy on NSL is not as ideal as that of KDD99. The proposed approach is also compared with GA-KELM. Taken together, although the GA can perform parameter optimization and feature selection, the results are slightly worse than those of the hybrid PSO. Tables 3–5 show that a competitive or better level of accuracy can be achieved with fewer features, which indicates that some features are uncritical to the performance of the classifier.
In the hybrid PSO-KELM, the frequencies of the features used in ten runs are listed in Tables 6 and 7. The feature with frequency equal to or greater than four times the other features’ frequencies is considered to be a significant feature. There are 41 features in the KDD99 and NSL datasets, represented by [TeX:] $$M_{l}, M_{2}, \ldots, M_{41}.$$ The significant features consist of [TeX:] $$M_{2}, M_{3}, M_{16}, M_{23}, M_{32}, M_{33}, M_{35}, M_{36} \text { and } M_{40} \text {. }$$ The important features of the Kyoto are [TeX:] $$P_{4}, P_{10}, \text { and } P_{14}.$$ The features chosen above are important in judging invasion. For instance, [TeX:] $$F_{23}$$ F23 represents the number of connections with the same target host as the current connection in the last 2 seconds. There is a very close relation between [TeX:] $$F_{23}$$ and the DoS attack. In contrast, the features that are not chosen even once, represented by “others” in the last column, are thought to be redundant.
Frequency of the selected features on KDD99 and NSL
Frequency of the selected features on the Kyoto
Finally, performance comparisons of the different methods for KDD99 were carried out. The average simulation results of the measures are shown in Table 8 [7,38]. Feature selection is performed, and the three methods used the same selected features. The proposed model has the best overall performance among the current popular machine learning methods. Compared with other activation functions, ELM with sigmoid (sig) activation function has the best performance. Its performance is dependent on the selection of the hidden neuron, which is set to 80 here. Table 8 shows that the hybrid PSO-KELM is better than the ELM in terms of accuracy. This is because KELM takes a stable kernel mapping as an alternative to the random mappings in ELM and has stable network output weights. Therefore, the KELM can avoid random fluctuations in the output of the model caused by random assignments in ELM and enhance the stability and generalization ability of the model. It also shows that SVM has relatively stable performance, but the false positive rate is not as good as our proposed approach.
Comparison of accuracy of different models (%)
6. Conclusion
A hybrid PSO-BPSO based KELM model is proposed and applied to intrusion detection. The standard PSO and binary PSO are both adopted to optimize the parameter combination and input features. A fitness function is designed for the hybrid PSO. It is proved by experiments that the method can determine the parameters and the appropriate features at the same time. The results also show that the method out¬performs the GA-KELM model. The Gaussian kernel is chosen in this work. In future works, other kernels or multiple kernel learning can be studied. In addition, other metaheuristic methods can also be developed to optimize the model.
Acknowledgement
This work is supported by the Fundamental Research Funds for the Central Universities (No. ZY20215151).