Yuanhang Jin , Maolin Xu and Jiayuan ZhengAutomatic Detection of Dead Trees Based on Lightweight YOLOv4 and UAV ImageryAbstract: Dead trees significantly impact forest production and the ecological environment and pose constraints to the sustainable development of forests. A lightweight YOLOv4 dead tree detection algorithm based on unmanned aerial vehicle images is proposed to address current limitations in dead tree detection that rely mainly on inefficient, unsafe and easy-to-miss manual inspections. An improved logarithmic transformation method was developed in data pre-processing to display tree features in the shadows. For the model structure, the original CSPDarkNet-53 backbone feature extraction network was replaced by MobileNetV3. Some of the standard convolutional blocks in the original extraction network were replaced by depthwise separable convolution blocks. The new ReLU6 activation function replaced the original LeakyReLU activation function to make the network more robust for low-precision computations. The K-means++ clustering method was also integrated to generate anchor boxes that are more suitable for the dataset. The experimental results show that the improved algorithm achieved an accuracy of 97.33%, higher than other methods. The detection speed of the proposed approach is higher than that of YOLOv4, improving the efficiency and accuracy of the detection process. Keywords: Dead Tree , Deep Learning , MobileNetV3 , Object Detection , YOLOv4 1. IntroductionThe tree health assessment in forests has important implications for biodiversity, forest management, and environmental monitoring. Dead trees are a vital indicator of forest biodiversity and ecosystem health, making dead tree detection essential [1]. Dead tree detection is relatively primitive, relying mainly on manual patrols and observations, which are usually labor-intensive and challenging to detect in treacherous terrain. Traditional manual detection methods are inefficient in meeting the increasingly critical needs of dead tree detection. Numerous studies have used remote sensing images to improve the efficiency of dead tree detection. For example, Otsu et al. [2] utilized point cloud data to detect dead trees, resulting in an overall accuracy of 94.3%. Kaminska et al. [1] employed multispectral images to classify dead trees, leading to an overall accuracy of 95%. However, the hardware cost of LiDAR and multispectral methods is high, and data acquisition can be complex. As a miniature low-altitude remote sensing platform, the unmanned aerial vehicle (UAV) has numerous advantages, such as low operating costs, convenient operations, and free use of time. The small UAV with a visible light camera can easily and quickly monitor the target area on a large scale, which is vital for developing small- and medium-scale remote sensing applications and forestry detection technology. With its growing applications in target detection using computer vision, machine learning has been utilized for tree detection in remote-sensing images. For example, Malek et al. [3] used scale invariant feature transform (SIFT) to extract a set of key points of palm trees. Then they used an extreme learning machine (ELM) classifier to distinguish palm trees from other vegetation. However, the SIFT algorithm only selects features for a few key points in the sample, which are local and less accurate than the RGBbased global feature method. Regarding deep learning in machine learning, Li et al. [4] applied a deep learning-based convolutional neural network (CNN) approach to detect densely planted Malaysian oil palm trees, resulting in 96% of samples being detected correctly. Culman et al. [5] applied CNN for phoenix tree detection in the Canary Islands and correctly detected 86% of the samples. Guirado et al. [6] proposed a CNN-based shrub detection method using Google Earth images as the data source, achieving better detection results than previous single-tree detection methods. Tao et al. [7] used a CNN to detect dead pine trees photographed using a UAV. Yu et al. [8] used YOLOv4 for the early detection of trees infected with pine nematode disease and achieved 57%–63% accuracy. However, their network structure is relatively simple, with insufficient information extracted and poor capability for detecting complex targets. Using YOLO as the base model, Junos et al. [9] made improvements (e.g., the addition of the swish activation function and the optimization of the prior bounding box to detect palm fruits) and were able to obtain better detection results. Liu et al. [10] used Fast-RCNN-based algorithms to detect palm trees at three sites in Malaysia and achieved more than 95% accuracy. Yarak et al. [11] combined high-resolution images with the Fast-RCNN [12] phase for automatic detection and health classification of palm trees, which also achieved good detection results. However, the above methods used long training and detection time models, which could not meet real-time demand. This paper addresses the shortcomings of costly data acquisition, insufficient information extraction, and lengthy training time. To obtain a complete image dataset, we collected tree images using a consumer-grade UAV platform, enhanced them using an improved logarithmic transformation, and used enhancement and image rotation for data augmentation. This paper used the classical YOLOv4 model of the One Stage target detection algorithm as the foundation to reduce network parameters and computations, improve training and detection speed, and ensure detection accuracy. First, the K-means++ clustering algorithm was used to obtain the prior bounding box instead of Kmeans in the original YOLOv4 model. The CSPDarkNet-53 backbone network was then changed to a lighter MobileNetV3 network, and depthwise separable convolution was added to the enhanced feature extraction network further to reduce the number of parameters in the model while ensuring the feature extraction capability. Finally, the improved MobileNetV3-YOLOv4 network model was used to train the dataset, and the weight files obtained after training were used to obtain the detection results. The technical flow chart of this paper is shown in Fig. 1. 2. Materials and Study Area2.1 Data AcquisitionIn this paper, dead trees were used as the object of interest. RGB images of dead trees in a scenic area of Southern Liaoning were collected using aerial photography of the forest area with drone-borne cameras. The time slot for image capture was from 3 pm–5 pm. The equipment used in capturing the images was a DJI PHANTOM4 RTK quadcopter, flying at an altitude of 80–120 m. The camera was flown at 90° vertical, with a heading overlap and a side overlap of 80%. 2.2 Image Pre-processingBecause of the large size of the original UAV image and the small amount of helpful information in the original image, using the entire image to train the deep learning network model would be very slow and inefficient. Therefore, in this paper, the original image was cropped to a 912×912 size image with 96 dpi, and the invalid image was eliminated manually. Finding sufficient data to complete the task in many practical projects would be difficult [13]. Before training, the dataset would need to be augmented with data to provide more images to avoid problems like poor model stability or overfitting. Image flipping and a modified log-transformed image enhancement method were used to process the images of the dead tree training set and add them to the training set. The light intensity of a scene shot at different angles can interfere with the captured tree image and affect the algorithm's detection. Adjusting the image brightness can reduce the effect of uneven illumination to a certain extent. Logarithmic transformation [14] is a commonly used image enhancement method representing a logarithmic relationship between the output image and its corresponding input image pixel grayscale values. It is used as a component of image processing algorithms to emphasize the lower grayscale portions of an image by partially expanding the grayscale values of the image as well as partially compressing the higher grayscale values. Its conventional Eq. (1) is:
where [TeX:] $$f(i, j) \text{ and } g(i, j)$$ are the grayscale pixel values of the output image and the corresponding input image, and A is the intensity parameter, which mainly transforms the dynamic range of the shadow feature enhancement to a suitable interval to show more details of the shadow region. When this method was used for image enhancement, the results indicated that the method was too large for image transformation and caused over-brightening or overexposure of the image. The basic equation for the logarithmic transformation can be changed into different forms depending on the application. To address excessive image transformation when the original logarithmic transformation is applied for image enhancement, an improved transformation was designed for dark area compensation:
where v is the adjustment parameter to make the image transformation more flexible, the value selection for the parameter v has an essential effect on image transformation. Fig. 2 compares the original image with the basic and improved Equations. After enhancement of the original image, the dead tree targets were annotated using the open-source software LabelImg [15], as shown in Fig. 3. An XML file in PASCAL VOC [16] format was used for annotation. 3. MethodIn this experiment, MobileNetV3, a lightweight network, was used to reduce the number of network parameters in backbone feature extraction and tackle the redundant structure of YOLOv4 [17]. Also, to address many convolutional blocks in the enhanced feature extraction network model of YOLOv4, a partial replacement was performed using depthwise separable convolution blocks to reduce the model parameters further while improving the detection speed. The LeakyReLU was replaced with the ReLU6 activation function to improve model robustness. The K-means++ clustering algorithm replaced the Kmeans algorithm in the YOLOv4 model to obtain a more reasonable and higher accuracy prior bounding box, making the model easier to learn. 3.1 Lightweight Network MobileNetV3At the heart of lightweight networks is the design of more efficient network computations for CNN, which can reduce the number of network parameters without losing network performance. This paper demonstrates a replaced backbone feature extraction network CSPDarkNet-53 of YOLOv4 with the MobileNetV3 network structure. The lightweight network MobileNetV3 [18] was obtained by improving the lightweight attention model from the depthwise separable convolution of MobileNetV1 [19], the inverted residual structure of the linear bottleneck of MobileNetV2 [20], and the squeeze-and-excitation (SE) [21] structure of MnasNet. 3.2 Depthwise Separable ConvolutionDepthwise separable convolution was made up of depthwise and pointwise convolution, as shown in Figs. 4 and 5. Only one convolution kernel convolved each depthwise convolution channel (i.e., one convolution kernel was responsible for only one channel in the feature map, extracting the features inside a single channel). The feature map channel obtained after the channel-by-channel convolution was the same as the original feature map channel; the size of each convolution kernel for pointwise convolution was C×1×1. The pointwise convolution was performed, and the feature information between different channels was fused to obtain a new feature map. Depthwise separable convolution yields a feature map with the same dimensionality as the standard convolution. However, depthwise separable convolution requires fewer parameters and model computations, significantly improving algorithm efficiency. 3.3 ReLU6For the model to be made more robust in the low-precision calculations, the original LeakyReLU activation function was replaced by the activation function ReLU6. The two activation functions differ mainly in their treatment when x is greater than 0. The LeakyReLU function does not perform any operation when x>0, while ReLU6 has a function value of 6 when x>6, increasing a boundary. The function equation of ReLU6 is (3):
3.4 K-Means++To reduce the dependence of the clustering results on the K-value selection when obtaining the prior bounding box and to make the initial clustering centers as far apart as possible, the original method was replaced by the K-means++ clustering method [22]. The K-means++ algorithm was adopted to optimize the selection of K initial clustering centers to reduce the clustering bias at the initial clustering points effectively. The K-means++ algorithm obtains better-sized prior bounding boxes and matches them to the corresponding feature maps, thus effectively improving the detection accuracy and recall rate. The K-means++ algorithm randomly selects a sample point as the first clustering center. The shortest distance between each sample and the existing clustering center is calculated, and the sample is classified based on the nearest clustering center. After calculating the probability for each sample, the sample with the highest probability is selected as the next center using Eq. (4):
where [TeX:] $$\boldsymbol{D}(\boldsymbol{x})$$ is the shortest distance from each sample point to the current center, the clustering center can recalculate the objects based on the existing clusters, repeating the process until no objects are reassigned to other clusters. Finally, the K clustering centers are selected. In this study, the anchor box widths and heights obtained using K-means++ clustering are as follows: (30,30), (59,45), (48,62), (79,64), (67,95), (97,87), (107,135), (149,106), and (172,173). 3.5 Improvements on the Enhanced Feature Extraction NetworkTo further reduce the number of network parameters and optimize the model, some of the general convolutions in the YOLOv4 enhanced feature extraction network were replaced with depthwise separable convolutions. The specific changes introduced in the proposed methodology are as follows: (1) The activation function in the 3×3 general convolution was changed from LeakyReLU to ReLU6 for more robustness in case of low-precision calculations. (2) The partial convolution in the three convolution blocks and the five convolution blocks of the PANet network were replaced with depthwise separable convolutions to reduce the number of parameters without affecting feature extraction as much as possible. (3) The 3×3 general convolution in both down-samplings with a depthwise separable convolution was replaced. The number of model parameters of the replaced MobileNetV3-YOLOv4 network was reduced from 64,363,101 to 11,692,029, equivalent to 1/5 of the model parameters of the original network, significantly affecting network optimization. The improved MobileNetV3-YOLOv4 network structure is shown in Fig. 6, with the orange part of the figure highlighting the improvements. 4. ResultsThe dead tree image dataset used in the experiments contained 10,000 images, 80% of which were used for training and validation and the remaining 20% for the test set. Fig. 7 shows some of the image data. 4.1 Experimental Environment & Evaluation IndicatorsThe experiments used the open-source TensorFlow framework to implement the network model in Python, with an NVIDIA RTX 3080 (10 GB) graphics card and 32 G of RAM. The initial learning rate of the training model was set to 0.0001, and the cosine annealing decay strategy was simulated to adjust the learning rate of the network with epoch set to 600. Subjective and objective evaluations were used, with the subjective evaluation being mainly human eye interpretation and the objective indicators being frames per second (FPS) and average precision (AP). FPS is the frame rate per second of detected images. Precision-Recall (P-R) is used to analyze the accuracy of the network predictions, where each metric is calculated as follows:
where true positive (TP) is the number of positive class samples that have been correctly detected and judged to be positive; false positive (FP) is the number of negative samples that have been incorrectly detected and judged to be positive; false negative (FN) is the number of positive class samples that have not detected and judged to be negative; [TeX:] $$\boldsymbol{r}$$ is the recall value, and [TeX:] $$\boldsymbol{P}(\boldsymbol{r})$$ is the precision value corresponding to the r-value [23]. The area of the P-R curve is then AP. In target detection tasks, the AP metric is used to measure the accuracy of a particular class of target detection for the good or bad performance of single target detection algorithms. Since only dead trees were the subject of target detection, the AP metric was used. 4.2 Analysis of Experimental Results4.2.1 Impact of different lightweight backbone networks on model performance Three lightweight networks (i.e., MobileNetV1, MobileNetV2, and MobileNetV3) were used to replace the backbone network in the original YOLOv4 model to show the superiority of replacing the backbone network; the same dataset was trained using the three networks. The weights obtained after training were also used to predict and compare the test set. The experimental results are shown in Table 1. Table 1.
Table 1 shows that the YOLOv4 model using MobileNetV3 as the backbone network achieves an 88.49%, which outperforms the other two models by 3.79% and 8.39%. Table 1 also shows the number of parameters and FPS of the three models; the number of MobileNetV3-YOLOv4 parameters is in between the three models. Due to the more straightforward structure of the MobileNetV1 network, it has a more significant FPS and faster detection speed, while the difference in detection speed between MobileNetV3-YOLOv4 and MobileNetV2-YOLOv4 is more negligible. Thus, in a comprehensive view, MobileNetV3-YOLOv4 provides better dead tree detection than the three models. 4.2.2 Effect of different enhanced feature extraction networks on model performance The training results from the original and improved extraction networks were analyzed and compared using the same dataset to see if the addition of ReLU6 and depthwise separable convolution can optimize the model. Table 2 summarizes the results of the two models. Table 2.
Table 2 shows that the improved extraction network increased the model’s detection and detection speed, with an AP value of 89.77%, improving the original network by 1.28%. This increment in AP value means that the improved network can retain features better and reduce feature loss while enhancing feature learning. The proposed algorithm can retain image information to a greater extent and is more suitable for detecting small targets that usually have sparse and difficult-to-learn features. After replacing the general convolution with a lighter depthwise separable convolution, the FPS was improved by 10.72, considerably improving the model’s detection speed. 4.2.3 Impact of clustering algorithms on model performance This paper analyzed and compared clustering using the K-means and K-means++ methods in network training to evaluate the prior bounding box obtained by K-means++. The clustering methods' prior bounding boxes were then used for training and prediction on the same test set, with the results shown in Table 3. The AP for clustering using K-means++ was 97.33%, equivalent to 7.56% higher than the YOLOv4 using K-means. The results show that K-means++ can achieve optimized clustering centers with aspect ratios that better match the characteristics of the dead tree dataset. Therefore, using the K-means++ algorithm for training and testing makes it easier to fit the prior bounding box to the target, reducing model training difficulties, enhancing localization, and improving the algorithm's detection accuracy. 4.2.4 Comparison experiments of different detection models To further evaluate model reliability, the same experimental data were used on four other models (i.e., YOLOv4, YOLOv4-tiny, SSD [24], Junos’ method [9]) and the proposed MobileNetV3-YOLOv4 model. Fig. 8 presents the trends for the five training models. As shown in Fig. 8, YOLOv4, YOLOv4-tiny, and Junos’ methods had higher losses than the other two. The SSD algorithm iterates to a plateau at epoch 200 but is slightly higher than the proposed algorithm. The proposed algorithm has the fastest decreasing loss between epochs 0–200, and the loss curve flattens out at epoch 400. To verify the effectiveness of the improved algorithm in detecting dead trees and visually comparing the models, each of the five models was used to detect the test set images. Fig. 9 presents some of the results. In Fig. 9(a), the SSD model misidentified similarly colored roads as dead trees, while the YOLOv4, YOLOv4-tiny, and Junos models could not detect them. In Fig. 9(b), the other four models missed detection when the background was complex, while the proposed algorithm accurately detected dead trees in the images. In Fig. 9(c), the proposed algorithm performed target detection well, while the other four models missed detection. The five algorithms previously mentioned were selected and tested on a test set to verify the performance of the improved MobileNetV3-YOLOv4 algorithm, as shown in Table 4. As shown in Table 4, the MobileNetV3-YOLOv4 detection model has a higher AP value than the YOLOv4 model by 14.62%, higher than the YOLOv4-tiny model by 13.66%, higher than the SSD model by 16.65%, and higher than the Junos’ method by 8.74%. The MobileNetV3-YOLOv4 model has a higher FPS than YOLOv4 and is lower than the SSD, YOLOv4-tiny, and Junos’ method. Table 4.
The AP values were recorded during the training process. Fig. 10 shows the P-R curves for YOLOv4, YOLOv4-tiny, SSD, Junos’ method, and MobileNetV3-YOLOv4 models for dead tree detection. The more extensive the coverage of the P-R curve on the coordinate system, the higher the detection accuracy and the better the model effect. As presented in Table 4, the AP values for YOLOv4, YOLOv4-tiny, and SSD were around 80%, indicating that the P-R curves for the three models are similar. The curve for MobileNetV3-YOLOv4 covers almost the entire coordinate system and also lies above the other curves, which suggests that the MobileNetV3-YOLOv4 model outperforms the other four models. 4.2.5 Testing model image adaptation experiments We screened 600 images of dead trees in bright and dark conditions through manual processing to obtain the effect of the model on detection accuracy, as shown in Fig. 11. Then, by comparing and verifying the adaptability of the models in detecting images, the YOLOv4, YOLOv4-tiny, SSD, and proposed algorithms were used on various images at different brightness. The experimental results are shown in Table 5. The algorithm had an accuracy of 97.85% for bright images, which is 12.35%, 14.3%, and 11.29% higher than YOLOv4, YOLOv4-tiny, and SSD, respectively. In images with darker settings, the algorithm achieved good detection results; the SSD algorithm also achieved accurate results, although slightly lower than the proposed algorithm by 2.24%. The YOLOv4 and YOLOv4-tiny algorithms were less effective in detecting images in the dark, achieving an accuracy of 88.80% and 86.23%, lower than the proposed algorithm by 9.35% and 11.92%. The results suggest that the proposed approach is better at image detection of dead trees for different environmental conditions. Table 5.
4.2.6 Impact of image enhancement on model accuracy A total of 6,000 images (original and processed) were used to investigate the effects of image enhancement on model detection accuracy; the experimental results are shown in Table 6. The dead tree detection accuracy after training with image enhancement was 95.87%, an improvement of 2.60% compared to the original image. The results show that the image enhancement process allows the model to detect more feature points and learn dead tree features more accurately, improving the model's detection capability. 4.2.7 Effects of different parameters on the accuracy of the model (1) Effects of different batch sizes on model accuracy Different batch sizes (i.e., 4, 8, and 16) were used in the experiments to evaluate the influence of batch sizes on model accuracy. The experimental results are shown in Table 7. From Table 7, the highest AP value was 97.33% for batch size 8, which is 0.69% and 7.03% higher compared to batch sizes 4 and 16. It is possible that when the batch size is set too small, the model slowly converges; when set too high, the memory becomes insufficient, causing the model’s generalization ability to weaken. Therefore, the detection effect is better in batch size 8. (2) Effect of the number of different datasets on the accuracy of the model Eight datasets analyzed the effect of different numbers of datasets on model accuracy with varying image counts (i.e., 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, and 8,000) were used in the experiments. The experimental results are shown in Table 8. As the number of datasets increased, the AP value for model detection also increased. The model detection accuracy improved by 8.5% and 8.92% when the images were increased from 1,000 to 2,000 and from 3,000 to 4,000, respectively. The model achieved better detection accuracy when the number of datasets reached 6,000, after which the accuracy improved by approximately 2% for every additional 1,000 datasets. The highest detection accuracy was 97.33% when images reached 8,000. 5. Discussion and ConclusionThis paper applied deep learning and drones for dead tree detection, developing an innovative approach that could significantly improve efficiency, reduce costs, and decrease terrain-related dangers. This proposed methodology is based on the MobileNetV3-YOLOv4 network of UAV imagery, using consumer-grade UAVs in acquiring tree imagery. An improved logarithmic transformation method is introduced in data pre-processing to address the overbrightening or overexposure of images after image enhancement. Adjusting the parameter V in the improved equation increases the flexibility of image adjustment and enhances the target features. Images processed using the proposed enhancement method were fed into the model for training, and the trained model improved the accuracy of dead tree detection. In terms of detection, the lightweight model MobileNetV3-YOLOv4 is proposed to reduce the number of model parameters and improve detection accuracy for small targets while ensuring extraction performance. The K-means++ meth-od was also introduced to generate a prior bounding box closer to the ground truth box, significantly improving the model detection performance. The detection accuracy rate reached 97.33%, and the proposed approach to detect dead trees confirmed the proposed method's reliability. By significantly improving the efficiency of dead tree detection, the proposed image detection approach has novelty and application values. This paper contains some shortcomings to be improved in subsequent studies. For example, there were still cases of false detection in the case of complex tree images, and the accuracy would have to be further improved. In addition, the image data used in this paper were taken by a consumer-grade UAV and hence were relatively homogeneous. In future research, more complex and diverse image data can be collected and produced from different devices. At the same time, the model can be further improved to enhance its generalizability and verify the method's feasibility to maximize the algorithm's value. BiographyYuanhang Jinhttps://orcid.org/0000-0003-1482-3027He is currently pursuing an M.E. degree at the School of Civil Engineering, University of Science and Technology Liaoning, Anshan, China. In 2019, he received a B.S. degree from the school of civil engineering, Liaoning University of Science and Technology, Anshan, China. His research interests include remote sensing, image detection and deep learning. BiographyJiayuan Zhenghttps://orcid.org/0000-0003-1944-1167She is currently pursuing an M.E. degree at the School of Civil Engineering, University of Science and Technology Liaoning, Anshan, China. In 2019, she received a B.S. degree from the School of Civil Engineering, Liaoning University of Science and Technology, Anshan, China. Her research interests include remote sensing, geo-graphical information science and deep learning. References
|