Wang Chang , Shijing Han* , Wen Zhang and Shufeng MiaoBuilding Change Detection Using Deep Learning for Remote Sensing ImagesAbstract: To increase building change recognition accuracy, we present a deep learning-based building change detection using remote sensing images. In the proposed approach, by merging pixel-level and object-level information of multitemporal remote sensing images, we create the difference image (DI), and the frequency-domain significance technique is used to generate the DI saliency map. The fuzzy C-means clustering technique pre-classifies the coarse change detection map by defining the DI saliency map threshold. We then extract the neighborhood features of the unchanged pixels and the changed (buildings) from pixel-level and object-level feature images, which are then used as valid deep neural network (DNN) training samples. The trained DNNs are then utilized to identify changes in DI. The suggested strategy was evaluated and compared to current detection methods using two datasets. The results suggest that our proposed technique can detect more building change information and improve change detection accuracy. Keywords: Deep Neural Network (DNN) , Difference Image , Frequency-Domain Significance , Fuzzy C-Means 1. IntroductionSince buildings are the primary locations for human activities, building change detection has been a hotspot for research in photogrammetry, remote sensing, and artificial intelligence. In recent years, scholars have proposed various building change detection technologies. Previous studies have used combinations of image spectral features and morphological building index features [1,2], combinations of image spectral, textural, and shape features [3], and combinations of the image spectral, textural, shape, and morphological building index feature differences [4,5] to detect building changes. Zhang et al. [6] integrated pixel-level and object-level features to increase the change detection accuracy of buildings. While various multi-feature fusion (MFF)-based approaches for detecting building changes have yielded positive results, in some cases, the approach may generate white spot noise if it fails to effectively highlight building change information. Numerous scholars have used deep learning to identify buildings. For instance, convolutional neural networks (CNNs) were utilized by Nemoto et al. [7] and El Amin et al. [8] to extract structures from images and detect building changes. To accomplish high-precision extraction of local objects, Liu et al. [9] employed deep neural network (DNN) to categorize spectral and textural properties and gathered random samples of diverse ground objects for categorization. To dig deeper into image features, Zhao and Du [10] and Liu et al. [11] advocated employing multi-scale CNNs, which convert original images into pyramidal structures and then extract roads, buildings, and other characteristics using multiple trainings. Most of these deep learning methods only detect changes at the pixel level and are greatly affected by the training samples [10,11]. Therefore, researchers have conducted deep network training with buildings as the target samples for building identification. For example, Vakalopoulou et al. [12] used the Fast R-CNN method to train a large number of labeled building samples, and then used the trained model to identify buildings in remote sensing images. Wang et al. [13] also adopted the Faster R-CNN algorithm to analyze changes in remote sensing images, so as to identify various ground objects (such as buildings and roads). The effectiveness and quantity of the training samples, however, are crucial to this method. The goal of this research is to increase the accuracy of building change detection by getting high-quality training samples from remote sensing images. We present a pixel-level and object-level feature fusion (POFF) and DNN-based building change detection method. To highlight building change information in the difference image (DI) and reduce white spot noise introduced by the POFF, using the structural similarity method (SSIM), we identify the largest difference in textural and form characteristics (after multi-scale segmentation). The final DI is constructed by fusing shape feature (SHF) DI, morphological building index (MBI) DI, textural feature (TF) DI, and the spectral feature (SF) DI acquired from the change vector analysis (CVA). To provide reliable pixel-level training data, the DI saliency map is created using the frequency-domain significance (FDS) method. The fuzzy C-means (FCM) clustering method pre-classifies the coarse change detection map (e.g., changed pixels, unaltered pixels, indeterminate pixels) by setting a DI saliency map threshold. Next, to obtain high-precision building change detection results, we extract the neighborhood features of the unchanged pixels and the changed pixels (buildings) from the multiple feature images, utilizing them as trustworthy samples for the DNNs training. Finally, we utilize the trained DNN classifier to perform building change detection on the crude change detection map to achieve the final result of building change detection. 2. The Proposed MethodThree steps are included in the suggested method: the construction of difference images by multi-feature fusion, high quality training sample selection, and the deep learning network classification, as shown in Fig. 1. 2.1 POFF to Construct the DIIn order to properly emphasize the changing building information while minimizing noise and data redundancy. There are three primary processes in the process of creating the DI. To begin, the image’s spectral characteristics, texture features, morphological building index feature, and form features are extracted. As the image spectral feature and the image textural feature, we figure out the average of each band’s spectral mean and the average of each band’s textural characteristics (grey level co-occurrence matrix [GLCM]) [14], in order to be able to effectively emphasize the building change information. Multi-scale segmentation of multitemporal remote sensing images was per-formed using eCognition software, and the image morphological building index feature was calculated using the mean of multi-scale top-hat transformation created by differential morphological profile [15]. By adjusting shape fac¬tors, scale parameters and compactness, we perform multi-scale segmentation on remote sensing images to extract shape features effectively. Second, in order to efficiently remove noise and data redundancy, we use SSIM to compute the differences in texture features and form features of multitemporal remote sensing images, and then pick the largest difference textural and shape features. Finally, we apply CVA to obtain the multiple feature image DIs of the multitemporal remote sensing images, and then we fuse the four DIs in a predetermined proportion to create the final DI [16], as shown in Fig. 2. 2.2 Training Sample Acquisition and Pre-classificationThis paper proposes to use FDS and FCM methods [15] to obtain high-quality training samples from coarse change detection maps, taking into account both how to extract changed (buildings) and unchanged training samples as well as how to identify building change regions in order to obtain high-quality training sample. FDS analysis is used to discover portions of an image that stand out more than other areas due to strong local or global contrast. The FDS approach is utilized to apply amplitude spectrum convolution and generate the saliency map, making the shape and location of the significant areas more equivalent to the modified regions. In this paper, the significance map is obtained by constructing an amplitude spectrum convolution using a scale-appropriate high-pass Gaussian kernel. Specifically, the Fourier transform converts an image [TeX:] $$f(x, y)$$ into the frequency domain [TeX:] $$f(x, y) \rightarrow F(f)(u, v)$$ by Fourier transform. The image’s amplitude [TeX:] $$P(u, v)=\text { angle }|F(f)|$$ and phase spectrums [TeX:] $$P(u, v)=\text { angle }|F(f)|$$ are then calculated, and spikes in the amplitude spectra [TeX:] $$|F(f)|$$ are suppressed using a Gaussian kernel [TeX:] $$h$$, as shown below [16]:
The inverse transform, which is computed by combining the smoothed amplitude spectrum [TeX:] $$A_S$$ with the original phase spectrum, generates the saliency map and is given by the expression:
In this paper, we construct the saliency map of the coarse change detection graph by setting the threshold value, and then pre-classify the pixels in the coarse change detection map. The FDS and FCM pre-classification algorithms successfully highlight the most probable altered locations in the DI while also narrowing the training sample search range. Additionally, it enables a more precise categorization, increasing the accuracy of the training samples that were acquired. 2.3 DNNs EstablishmentThe neighborhood features of the original, morphological building index, textural feature, spectral feature, and shape feature images are converted into vectors and inputted into the neural network for training after pre-classification. These neighborhood features are present in the unchanged and changed class pixels in the coarse change map. Given that the multilayer backpropagation neural network does not always achieve satisfactory results, the restricted Boltzmann machine (RBM) is chosen since it only uses one feature layer at a time as the network training model. The procedure for this method is shown in Fig. 3, where [TeX:] $$\omega_1, \omega_2, \omega_3, \omega_4$$ are the weights of each layer; [TeX:] $$\varepsilon_1, \varepsilon_2, \varepsilon_3, \varepsilon_4$$ are the learning rates of each layer; [TeX:] $$t_1$$ and [TeX:] $$t_2$$ are images of remote sensing taken at separate periods,and [TeX:] $$V_{ij}$$ represents the feature vector. The neighborhood features of the changed class [TeX:] $$w_{c}$$ and unchanged class [TeX:] $$w_{uc}$$ (Fig. 3(a)) are first inputted. A stack of RBM is learned for pre-training (Fig. 3(b)), and after that, the RBMs are unfolded in a way that creates a deep neural network (Fig. 3(c)). We fine-tune the deep neural network using the backpropagation of the error derivative (Fig. 3(d)). Fig. 3(e) depicts the basic construction of an RBM network. RBM has [TeX:] $$l$$ visible units [TeX:] $$\left(v_1, v_2, \cdots v_l\right)$$ corresponding to its input features and[TeX:] $$n$$ hidden units [TeX:] $$\left(h_1, h_2, \cdots h_n\right)$$ that are trained, such that a visible unit must be connected to a hidden unit. [TeX:] $$W_{n \times l}$$ is a weight matrix between the hidden and visible layers,[TeX:] $$a=\left(a_1, a_2, \cdots a_l\right)$$ are biases of the visible units, and [TeX:] $$b=\left(b_1, b_2, \cdots b_n\right)$$ are the hidden bias units. The energy for the combined arrangement of visible and hidden units is given by the expression [17]:
(3)[TeX:] $$E(v, h)=-\sum_{i \in \text { pixel }} b_i v_i-\sum_{j=\in \text { features }} c_j h_j-\sum_{i, j} v_i h_j W_{i j}$$Suppose that [TeX:] $$\forall_{i, j}, v_i \in\{0,1\}, h_j \in\{0,1\}$$, then, for a given [TeX:] $$v$$, each hidden unit’s probability of being in a binary state [TeX:] $$h_{j}$$ is set to 1.
In Eq. (4), [TeX:] $$\sigma(x)=1 /\left(1+e^{-x}\right)$$ is used as a sigmoid function. After the hidden units are set as binary states, the reconstructive data are produced by setting the probability of [TeX:] $$v_{i}$$ to 1.
The features of the reconstructed data are then represented by updating the states of the hidden units. The change in weight is calculated by:
(6)[TeX:] $$\Delta W_{i j}=\varepsilon\left(\left\langle v_i h_j\right\rangle_{d a t a}-\left\langle v_i h_j\right\rangle_{r e}\right)$where [TeX:] $$\mathcal{\varepsilon}$$ is the learning rate, and [TeX:] $$\left\langle v_i h_j\right\rangle_{d a t a}$$ and [TeX:] $$\left\langle v_i h_j\right\rangle_{re}$$ are the fraction for the data and the corresponding fraction for the reconstructions. A reduced version of the same learning rule is used for the biases. The two-layer RBM network is then used to model the neighborhood features. The feature detector corresponds to the hidden units, and these characteristics relate to the RBM's visible units. The energy function in Eq. (3) imparts a probability to each potential pixel in the RBM network. Given that the training sample set was the one previously selected, we use a stack of RBMs for pre-training and any information on the class labels not used before pre-training. The upper layer’s output is used as the lower layer’s input, and the previously specified rules are applied to each layer of the two-layer RBM network. After pre-training, the RBM model is expanded to generate a deep neural network, initially using the same deviations and weights. The entire network adopts the cross-entropy error backpropagation strategy to fine-tune the weight and obtain the optimal classification. The cross-entropy error is expressed as:
The sample [TeX:] $$i$$'s label is [TeX:] $$e_{i}$$, and the classification result is [TeX:] $$\hat{e_{i}}$$. The final deep neural network is the outcome of training and fine-tuning the neural network. The neighborhood features of each site are input into a deep neural network, which generates pixel class labels. The class labels 0 and 1 stand in for the original and modified pixels, respectively. 3. Discussion and Analysis of Experiments3.1 Remote Image DatasetsTo assess the suggested method’s performance, two remote sensing images were chosen. The first dataset comprises a 1-meter resolution IKONOS true-color image for May 2004 and 0.1-meter resolution UAV images obtained in March 2008. The image has blue, green, and red bands (see Fig. 2(a)), and the image size is 612[TeX:] $$\times$$612m (see Fig. 2(b)). Fig. 2(c) presents the image used for ground-truthing. With an image size of 1418[TeX:] $$\times$$1700m, the second dataset is composed of panchromatic images from IKONOS for July 2005 and from WorldView-2 for July 2010. The spatial resolution is 1m/pixel for the IKONOS images (see Fig. 4(d)) m and 0.5m/pixel for the WorldView-2 images (see Fig. 4(e)). Fig. 4(f) shows the image used for the ground-truth verification. 3.2 Model Evaluation CriteriaTo assess the change detection accuracy, three metrics were used: completeness (Com), correctness (Cor), and quality (Qua). The following are the formulas for calculating the three criteria:
FP stands for the number of error-detected building pixels, FN for undetected building pixels, and TP for proper building pixels. 3.3 Rough Change Detection MapThe DIs for the two image datasets were created with POFF. For the first image dataset, we used the homogeneity texture feature (GLCM) in the SSIM computations, for the two image datasets, the elliptic fit shape feature was used, and for the second image dataset, the dissimilarity texture feature (GLCM) was used. Using experimental analysis, the fusion coefficients of the shape feature DI, morphological building index DI, textural feature DI, and spectral feature DI were set (1,0.1,0.1,1) and (0.2,0.2,0.2,1.5) for the two datasets, respectively. To avoid losing changed pixels at the coarse change detection stage while decreasing DNN training time, the crude change detection map was generated using the grey-scale mean of the saliency map as the threshold value. The threshold value for the first dataset was set to 0.002 and 0.003 for the second dataset. The crude change detection maps, saliency maps, and DIs for the two datasets are shown in Fig. 3. The DIs created by POFF can highlight the change areas, as seen in Fig. 5(a) and 5(d). The DI-generated salient regions are comparable in shape and arrangement to the FDS-generated salient areas, as shown in Fig. 5(b) and 5(e). As demonstrated in Fig. 5(c) and 5(f), the change area in the crude change detection map is virtually identical to the change area of the ground-truth picture. 3.4 Build Change Detection MapThe trained DNN classifier generated the final building change detection map. The accuracy of the suggested technique for detecting building changes was compared and evaluated using the POFF+FLICM, POFF+SVM, and Zhang's method [18]. Many white noise points were found in the building change maps of the first dataset generated using POFF+FLICM, POFF+SVM, and Zhang's technique [18], and certain building change regions were missing (see Fig. 6). In comparison, the building change map obtained using POFF+FDS+DNNs included less white spot noise and was able to identify more building change regions (see Fig. 6). As a consequence, POFF+FDS+DNNs have greater Com, Cor, and Qua values than the POFF+FLICM, POFF+SVM, and Zhang's technique [18] (see Table 1 and Fig. 7). The building change maps generated using the POFF+FLICM, POFF+SVM, and Zhang's technique [18] featured numerous white noise points in the second dataset, but fewer building change regions were missed (see Fig. 6). The building change map obtained by POFF+FDS+DNNs also contained fewer white spots and identified more building change areas (see Fig. 6). As a consequence, as shown in Table 1 and Fig. 7, the Com, Cor, and Qua values of the POFF+FDS+DNNs are greater than those of the POFF+FLICM, POFF+SVM, and Zhang's technique [18]. Based on the above analysis, the POFF+FDS+DNN approach is capable to obtain high-quality training samples and achieve high accuracy classification. 3.5 Analysis of ParametersThe size of the neighborhood window [TeX:] $$k$$ for the feature extraction and the DNN parameters may affect the change detection accuracy of the POFF+FDS+DNNs. 3.5.1 The neighborhood window [TeX:] $$k$$The neighborhood window [TeX:] $$k$$ is an important parameter affecting the final change detection results and is mainly determined by the accuracy of change detection (Com, Cor, Qur). Therefore, we used 3, 5, 7, 9, 11, and 13 for the value of [TeX:] $$k$$ in the experimental comparison. Based on the results of the experimental analyses, the accuracy of change detection is higher in both datasets when [TeX:] $$k$$ was given the value of 5. 3.5.2 DNNs parametersNumerous factors (e.g., batch size, number of iterations, number of layers, and number of nodes per layer) must be adjusted to construct high-performance DNNs. Since the batch size defines the data subset used in training the network, choosing the best batch size is critical. During training, the number of iterations refers to the number of times Gibbs sampling is applied to each layer. The more layers a DNN has, the better it can detect features in general. However, overfitting may occur when too many or too few nodes are found in each layer. The network may not learn features because it is too complicated for the dataset being analyzed. In this paper, the DNN parameters were determined according to the accuracy of building change detection. For the batch size, the first dataset had a value of 50, whereas the second had a value of 30, while the number of iterations was set to 100 in both datasets. In the experimental process, as the layer deepens, the training time increases and the data becomes more prone to overfitting. Therefore, a deep network with a 50-250-1 architecture is recommended for the two datasets. Table 1.
4. ConclusionTo increase the efficacy of detecting changes in buildings using remote sensing images, we created a deep learning-based building change detection technique in this research. The POFF technique of DI creation may effectively highlight the changing regions. The proposed FDS-FCM method can obtain reliable training samples, and the final building change map is produced using DNN classification. Compared to existing detection techniques, the POFF+FDS+DNNs can detect more building change information and achieve higher detection accuracy. BiographyBiographyReferences
|