1. Introduction
Image semantic segmentation technology is designed to label each pixel in the image with semantic information, thereby segmenting the image into several regions with different attributes and categories. Image semantic segmentation is a basic research content in the field of computer vision [1]. Image semantic segmentation technology can be applied to many fields such as medical imaging and geographic remote sensing, providing a strong guarantee for intelligent upgrades such as medical auxiliary diagnosis and remote sensing image interpretation [2]. However, the current semantic segmentation technology still needs to overcome the problems such as loss of small-scale objects, discontinuous segmentation, and incorrect segmentation. Therefore, how to enhance the representation ability of spatial detail information is the key research content to improve the segmentation accuracy [3].
In traditional image segmentation, the classic methods range from threshold segmentation, the simplest one, to region growth, edge detection, and graph partition. Among them, normalized cut and GrabCut are two classic segmentation methods based on graph partitioning [4]. Normalized segmentation uses the minimum se gmentation algorithm in graph theory to semantically segment the image. GrabCut is an interactive image segmentation method that uses image texture and boundary information to obtain better foreground and background segmentation results with a small amount of user interaction [5]. Although the computational complexity is not high, the traditional image segmentation algorithm is devoid of no data training stage and it has limited segmentation performance on more difficult segmentation tasks.
With the continuous improvement of classification network performance, there has been growing interest in solving the pixel-level labeling problem with semantic segmentation [6]. Compared with traditional image segmentation methods, the semantic segmentation method based on deep learning automatically learns features from the data, rather than using manually designed features. Using deep neural networks can achieve end-to-end semantic segmentation prediction [7]. Deep learning uses multilayer neural networks to automatically learn high-level features from a large amount of training data. Deep learning has been widely used in a variety of computer vision tasks [8].
However, the technical difficulties of image semantic segmentation in the three aspects of target, category and background still need to be resolved [9,10]. Therefore, an image semantic segmentation using an improved ENet network is proposed. The innovations of the proposed method are:
(1) The proposed model reduces convolution operation in decoder and adopts initialization operation to generate fusion features. In addition, the adaptive bottleneck structure is used to accelerate the segmentation to a great extent.
(2) To improve the accuracy of image semantic segmentation in complex environment, squeeze-and-excitation (SE) module is incorporated into the proposed model. Through learning, the importance of each feature channel is automatically obtained, the weight of useful features is improved, and the features that are not useful for the current segmentation task are suppressed, so as to achieve accurate segmentation of small samples.
The rest of this paper includes: Section 2 summarizes related work, classifies existing image semantic segmentation methods, and analyzes their advantages and disadvantages. Section 3 elaborates on the proposed method, applying the improved ENet model to image semantic segmentation. In Section 4, experiments and discussions are carried out, and the performance of the method in this paper is evaluated. Section 5 is the conclusion.
2. Related Work
The traditional image segmentation algorithm divides the image into different regions based on its color, texture information and spatial structure. The same region has consistent semantic information, and the attributes of different regions are different [11]. At present, the mainstream image semantic segmentation algorithm mainly through feature extraction, restoration, fusion, optimization four processes to obtain the target region of interest in the image to be segmented.
In the feature extraction stage, a large number of downsampling and pooling operations lead to the loss of spatial and detailed information, such as fully convolutional networks (FCN). Therefore, a dilated convolution and spatial pyramid pooling module was subsequently proposed to enhance global semantic information [12]. Network models such as DeepLabV1, dense relation network (DRN) and other network models increase the receptive field through serial dilated convolution and obtain richer spatial features. In [13], the authors expands the traditional method by developing a deeper network architecture with a smaller kernel to enhance its discrimination ability. Guo et al. [14] proposed a novel dense-Gram network, which uses clean images and degraded images for training through a pre-processing module based on image restoration, and fine-tunes the pre-trained network. Different from the traditional image semantic segmentation strategy, it can reduce the gap more effectively and realize image degradation segmentation. Reference [15] proposed a prototype image segmentation architecture based on convolution neural network (CNN) to realize automatic laparoscopic control of cholecystectomy. By establishing a recursive network structure that includes multiple use of sub-networks, to alleviate overfitting. The amount of computation, however, is tremendous for this type of method [16]. Network models such as DeepLabV2 and DenseASPP use the spatial pyramid pooling module to extract global semantic information and achieve denser multi-scale feature extraction. The above methods, however, will cause a checkerboard effect, resulting in the loss of local information and the discontinuity of semantic information. Feature restoration restores the resolution of the feature map by upsampling the feature map, which is used for model classification prediction [17]. Methods such as bilinear interpolation and deconvolution have certain limitations in restoring the resolution of feature maps. Zheng et al. [18] extracts high-level semantic features from the network and builds a dense deconvolution network. Finally, super pixel segmentation and spatial smoothness constraints are used to further improve image segmentation recognition results. Although the above methods enhance the expression of features, the segmentation ability of small-scale targets still needs to be improved [19].
Feature fusion obtains richer semantic information through the addition fusion of feature maps, splicing fusion, and cross-layer fusion to improve segmentation accuracy. Addition or splicing is often used to fuse multi-scale features [20]. FCN, U-Net, RefineNet, DeepLabV3+ and other network models adopt the idea of cross layer fusion, which combines the shallow detail features with the deep abstract features. It enhances the representation ability of high-resolution detail information, and opens up a new idea for the research of semantic segmentation. Inspired by the architecture of residual network and deconvolution network, Ozturk and Akdemir [21] proposed an automatic semantic segmentation based on cell type using a new deep CNN (DCNN). Four kinds of semantic information in medical image recognition are proposed and a new DCNN architecture is created. Feature optimization usually uses conditional random fields or Markov random fields to optimize the prediction results of semantic segmentation. By combining low-level image information with pixel-by-pixel classification results, the ability of model capturing fine-grained is improved [22]. In [23], the authors proposed a graph model initialized by a fully convolutional network named Graph-FCN. The graph convolution is used for semantic segmentation and achieves very good results.
With small target segmentation and recognition as the heart of the issue, most of the existing research uses multi-scale fusion enhanced network semantic segmentation algorithm to improve the accuracy of small-scale target segmentation. Zhou et al. [24] constructed the difference merging module in DCNN to extract the edge gradient of the object and obtain better boundary in the segmentation result. Then, the pyramid pooling module and the space-free pyramid pool are combined to extract image global features and contextual structure information by establishing long-distance dependence between pixels. This method stands outs from the traditional methods for it simplifies the original image preprocessing and post-processing steps.
3. Proposed Method
3.1 Solving Steps of Classic Ant Colony Algorithm
ENet network is a lightweight image semantic segmentation network capable of achieving pixel level semantic segmentation. The network features few parameters and fast calculation speed, which meets the real-time and accuracy requirements of image semantic segmentation. At the same time, ENet network also has a certain degree of plasticity. Based on this, the ENet network is pruned and convolution optimized, and integrated into the SE module to automatically learn the importance of each channel. An improved ENet network is proposed to better perform semantic segmentation tasks.
3.2 ENet Network Structure
ENet network adopts the current lightweight encoder-decoder network structure. As network specially designed for low-latency operation tasks, it has huge advantages in model size and parameter amount [25]. The ENet network changes the previous encoder-decoder symmetrical structure, reduces the convolution operation in the decoder, and enhances the processing speed tremendously. The ENet network has an initialization operation for the input image, as shown in Fig. 1. Its main purpose is to generate feature maps, and merge the feature maps generated by pooling and convolution operations.
Initialization operation.
The convolution operation has a total of 14 3×3 filters with a sliding step of 2, and a total of 14 feature maps are obtained. Maxpooling is a non-covered 2×2 sliding window, and four feature maps are obtained. Finally, a total of 18 feature maps are obtained after fusion [26]. In addition, a bottleneck convolution structure is also used in the ENet network. This module runs through the ENet network and is mainly used in encoder-decoder. The specific structure is shown in Fig. 2. Each packaged convolution module contains three convolutional layers [27].
The structure of bottleneck convolution.
From left to right in Fig. 2 are 1 1×1 projection mapping (used to reduce dimensions), 1 main convolutional layer and 1 1×1 ascending dimension; normalization and PReLU activation operations are performed between convolutional layers. The bottleneck convolution module is not static and will change according to specific operations. If it is a downsampled bottleneck convolution module, the 1×1 projection mapping is replaced by a Maxpooling layer with a kernel size of 2×2 and a step size of 2, and it is filled with 0 to match the size of the feature map. Conv is a 3×3 conventional convolution, expansion convolution or full convolution, and sometimes 1×5 and 5×1 asymmetric convolution are used instead. Regularizer uses Spatial Dropout to solve the problem of model overfitting.
The overall architecture of the ENet network consisting of five parts is between initialization and final full convolution. The first part is 1 downsampling bottleneck convolution module and 4 ordinary convolution bottleneck convolution modules. The second part is the Maxpooling bottleneck convolution module, followed by 8 different bottleneck convolution modules. The third part is 8 different bottleneck convolution modules. The fourth part is 1 upsampled bottleneck convolution and 2 ordinary bottleneck convolutions. The fifth part is an upsampled bottleneck convolution and an ordinary bottleneck convolution module. Finally, full convolution outputs the final result of image semantic segmentation. The fourth and fifth parts do not use the expanded convolution module because the encoding modules of the first three parts have already segmented the image completely, and there is no need to expand the field of view to extract feature information. The decoding structure mainly serves to restore the resolution of the image and improve the efficiency of the network model operation [28,29].
3.3 SE Module
The SE structure is the interdependence of the modeling feature channels displayed on the channel domain, and is used for feature recalibration. The core of the SE module is compression squeeze and excitation. After the convolution operation has obtained the features with multiple channels, the SE module can be used to re-calibrate the weight of each feature channel [30]. The SE module is divided into three steps, namely compression, excitation and reweighting. The schematic diagram is shown in Fig. 3.
The given feature map is [TeX:] $$X, X \in R^{H \times W \times K},$$ where H, W and K refer to the height, width and number of channels of the feature map respectively. After a compression operation (global average pooling), [TeX:] $$y \in R^{K \times 1}$$ is generated. Where [TeX:] $$y_{m}$$ is the [TeX:] $$y^{\text {th }}$$ yth m element of, and [TeX:] $$X_{m}$$ is the [TeX:] $$m^{\text {th }}$$ feature map of X:
The excitation operation is realized by using two parameters [TeX:] $$W_{1}, W_{2},$$ a fully connected layer and two activation functions to generate [TeX:] $$\tilde{y} \in R^{K \times 1}$$ as shown below:
where, [TeX:] $$\sigma$$ σ represents the sigmoid activation function, and [TeX:] $$\phi$$ represents the ReLU activation function.
The final step is to determine the weighting operation. By multiplying the weight obtained by the excitation operation with the previous feature channel by channel, the recalibration of the feature on the channel domain is completed. Generate the rescaled feature map cluster [TeX:] $$\tilde{X} \in R^{H \times W \times K},$$ each feature map [TeX:] $$\tilde{X}_{m} \in R^{H \times W \times 1}$$ is as follows:
where, [TeX:] $$F_{\text {scale }}\left(X_{m}, y_{m}\right)$$ is a channel by channel multiplication, [TeX:] $$\tilde{X}_{n} \text { is the } n^{\text {th }}$$ characteristic graph of [TeX:] $$\tilde{X}.$$
3.4 ENet Network Architecture Integrated with SE Module
Image semantic segmentation involves three stages: data preprocessing stage, training stage and testing stage. Labelme is used for manual labeling.in the data preprocessing stage. Training data and test data are generated by cutting the research area, where the training data includes the training set and the validation set. K-fold cross-validation is used to realize the automatic division of training set and validation set [31]. In the training phase, the pre-processed training samples are put into the improved ENet network fused with the SE module. The model architecture of the network is shown in Fig. 4.
Improved ENet network architecture with SE module.
The improved ENet network adopts an encoder-decoder structure. The encoder uses conventional convolution and a residual structure with dilated convolution to extract high-level semantic features, followed by batch normalization and PReLU activation functions after each layer of convolution. The decoder reduces the convolution operation, but incorporates the Bottleneck convolution structure. The SE module automatically learns the importance of each channel to better perform semantic segmentation tasks. The feature map is then restored to the original image size by linear interpolation, and the softmax activation function and the argmax function are used to obtain the final segmentation result, thereby achieving the end-to-end classification task.
4. Experimental Results and Analysis
In the experiment, the TensorFlow deep learning framework released by Google was used to construct an improved ENet network. The proposed model is experimentally demonstrated based on the PyThon simulation platform. The GPU model is RTX 2080Ti, the operating system is Ubuntu 16.04, the CPU model is i7-8700k, and the memory is 12G.
4.1 Network Parameter Setting
When training the network, the input image undergoes local response normalization before the first layer of convolution, [TeX:] $$\alpha=0.0001, \beta=0.75 \text {. }$$ The learning rate is set to 0.001, the weight attenuation is set to 0.0001, and the number of iterations is set to 60,000. The data set is randomly shuffled during the training process, and the batch size is set to 5. Use the cross entropy loss function and add L2 regularization to the network to prevent overfitting.
4.2 Evaluation Index
In order to evaluate the performance of the network model in this paper, we used the following evaluation indicators.
(1) Running time: Including training time and test time. In some cases, it is difficult to determine the exact running time of the model, because it depends to a large extent on the hardware device and the background implementation. However, providing information about the hardware and running time of the model can help evaluate its effectiveness.
(2) Accuracy: Pixel accuracy (PA) refers to the ratio of correctly classified pixels to the total pixels. In case unbalanced categories arise in the test set, the pixel accuracy rate cannot serve as a reliable indicator of the model's performance. Two evaluation indicators, therefore, are defined here: mean pixel accuracy (MPA) and mean intersection-over-union (MIOU).
Suppose there are a total of c+1 categories. [TeX:] $$p_{i j}$$ is the number of points for predicting i type as j type; [TeX:] $$p_{i i}$$ represents the number of points whose true value is i and predicted value is [TeX:] $$i ; p_{i j}$$ represents the number of points whose true value is i and predicted value is [TeX:] $$j ; p_{j i}$$ represents the number of points whose true value is j and predicted value is i. Then MIOU is calculated as follows:
MPA is the average of pixel accuracy of each category, and is calculated as follows:
4.3 CamVid Dataset
CamVid is the earliest semantic segmentation data set used in the field of autonomous driving. At first, five video sequences with a resolution of 960×720 pixels were shot on the car dashboard, with the shooting angle basically the same as that of the driver. Using image annotation software, 700 images were continuously annotated in the video sequence, including 32 categories such as buildings, trees, sky, roads, cars, and buses.
In order to more intuitively reflect the improvement of pixel category consistency by the improved ENet network, compare it with the segmentation results of the traditional ENet network, as shown in Fig. 5. To represent more intuitively the improvement of pixel category consistency with the improved ENet network, we compare it with the segmentation results of the traditional ENet network, as shown in Fig. 5.
As can be seen from Fig. 5, different from the traditional ENet model, this method adopts initialization operation to generate fusion features. The adaptive bottleneck convolution structure is used to replace the traditional convolution layer, and the fusion of the SE module can significantly improve the category consistency between adjacent pixels. And the misdetection of pixel categories contained in the same target is greatly reduced.
In order to further demonstrate the segmentation performance of the proposed model, we compare it with the models in [13,18,24]. The results of MPA and MIOU of each model on the CamVid data set are shown in Table 1.
Segmentation results of different models on CamVid dataset: (a) original image, (b) manual marking, (c) ENet, and (d) improved ENet.
Comparison of the results of different methods on the CamVid dataset
As can be seen from Table 1, the improved ENet network used in the proposed model outperforms other comparison methods in MPA and MIOU. In [13], the authors enhanced its discrimination ability by employing a smaller kernel and a deeper network architecture to achieve high-precision image segmentation. However, the model shows a low overall performance since it extraction accuracy is not desirable in complex environments. Zheng et al. [18] extracts high-level semantic features in deep convolutional networks, and introduces short connections in the deconvolution stage. Superpixel segmentation and spatial smoothness constraints are used to further improve the image segmentation recognition results. However, the accuracy of this method in the segmentation of small-scale targets still needs to be improved. Zhou et al. [24] constructed a difference merging module in DCNN, and established a long-distance dependency relationship between pixels through the combination of a pyramid pooling module and a space-free pyramid pool to extract image global features and contextual structure information, and achieved good results effect. Since the proposed model adopts an improved ENet network, the network is adaptively adjusted through the bottleneck convolution structure to better adapt to complex images, and the SE module is used to increase the weight of useful information. Therefore, the MAP and MIOU values have been further improved, reaching 0.8385 and 0.7562, respectively.
The running time is used as an evaluation index of the image segmentation model. Results of the running time comprised with the proposed model and different three methods [13,18,24] are shown in Table 2.
As can be seen from Table 2, of the three methods, the model in [13] has the shortest running time, 0.0537 seconds. The model is less time-consuming since it uses a learning network with a smaller kernel and has a simple structure. Methods in [18] and [28] have a long running time. Although these methods have good segmentation effects, their computational efficiency is sacrificed due to their complex structure. The improved ENet network in the proposed model reduces the convolution operation and integrates the SE module to speed up the extraction of useful features, thus ensuring segmentation accuracy and operating efficiency as well.
Running time comparison of the CamVid dataset
4.4 Cityscapes Dataset
The Cityscapes dataset contains 5,000 image scenes. The training set, validation set, and test set of the urban landscape dataset consist of 2,975,500 and 1,525 images, respectively, including 19 categories such as ground, building, sky, people, and vehicles.
Similarly, we compare it with the segmentation results of the traditional ENet network, as shown in Fig. 6.
As can be seen from the first column of Fig. 6, the proposed algorithm uses the SE module to improve the ENet network, which can effectively segment the pedestrians lost in the traditional ENet network, and improves the segmentation ability of small-scale targets. In the second column of segmentation results, the traditional ENet network mistakenly identified the bus rearview mirror as a pedestrian, while the improved network uses a weak bottleneck module to avoid segmentation errors for small targets. The segmentation results in the third column also prove that the improved network is better at segmenting and predicting small-scale targets than the traditional ENet network.
Segmentation results of different models on Cityscapes dataset: (a) original image, (b) manual marking, (c) ENet, and (d) improved ENet.
In addition, the segmentation performance of the proposed model is compared with that of [13,18,24]. The results of MPA and MIOU of each model on the Cityscapes data set are shown in Table 3.
As can be seen from Table 4, the structure of [13] is simple, easy to train, and has the shortest running time. Methods in [18] and [24], both improve the learning network, and improve the segmentation accuracy through the optimized network model, but the running time is longer, with 0.0519 seconds and 0.0668 seconds, respectively. The improved ENet network in the proposed model reduces the convolution operation, and integrates the SE module to speed up the extraction of useful features. Therefore, while ensuring the accuracy of segmentation, the operating efficiency is ensured, and the running time is 0.0437 seconds. The proposed model, therefore, can ensure both segmentation accuracy and operating efficiency with a running time of 0.0437 seconds.
Running time comparison of the Cityscapes dataset
5. Conclusion
In recent years, the persistent development in automatic driving and security monitoring has placed higher requirements for model size, calculation cost, and segmentation accuracy in image semantic segmentation. For this reason, an image semantic segmentation model using improved ENet network is proposed. The ENet network is improved by using the initialization operation and the bottleneck convolution structure, and the SE module is integrated, and the importance of each feature channel is automatically acquired through learning. The improved ENet network integrated into the SE module is used for image segmentation of small targets in the complex environments. Finally, the proposed model is experimentally demonstrated based on the CamVid and Cityscapes datasets. Results show that the MPA and MIOU values of the three datasets of the proposed model are higher than other comparison methods, and the running time is shorter. For the Cityscapes dataset, its MPA, MIOU, and running time are 0.9056, 0.8465, and 0.0437 seconds, respectively. For the CamVid dataset, its MPA, MIOU, and running time are 0.8385, 0.7562, and 0.0692 seconds, respectively. The proposed model in this paper ensures both operating efficiency and segmentation accuracy.
In our future work, we will endeavor to improve the accuracy of target boundary segmentation and the ability to successfully segment small targets, and overcome the problem of discontinuous target segmentation, which will further improve the performance of semantic segmentation model.