Xiuguo Zou* , Qiaomu Ren* , Hongyi Cao* , Yan Qian* and Shuaitang Zhang*
Identification of Tea Diseases Based on Spectral Reflectance and Machine Learning
Abstract: With the ability to learn rules from training data, the machine learning model can classify unknown objects. At the same time, the dimension of hyperspectral data is usually large, which may cause an over-fitting problem. In this research, an identification methodology of tea diseases was proposed based on spectral reflectance and machine learning, including the feature selector based on the decision tree and the tea disease recognizer based on random forest. The proposed identification methodology was evaluated through experiments. The experimental results showed that the recall rate and the F1 score were significantly improved by the proposed methodology in the identification accuracy of tea disease, with average values of 15%, 7%, and 11%, respectively. Therefore, the proposed identification methodology could make relatively better feature selection and learn from high dimensional data so as to achieve the non-destructive and efficient identification of different tea diseases. This research provides a new idea for the feature selection of high dimensional data and the non-destructive identification of crop diseases.
Keywords: High Dimensional Data , Machine Learning , Spectral Reflectance , Tea Diseases
Tea has a long history in China to act as one of the ancient drinks and a major cash crop. Tea has the functions of clearing heat, detoxifying, and relieving fatigue, etc., thereby gaining much popularity among consumers . The main producing areas of tea in China are featured by a warm climate with a humid environment. However, such climatic conditions are conducive to the breeding of pathogens. Besides, tea may also have diseases in the course of transportation and storage, resulting in a significant decline in the quality and production of tea. It has become a hot and required topic about identifying the tea diseases in the early stage [2,3]. Conventional disease identification methods include manual methods and physicochemical methods. The manual methods identify tea diseases through visual and tactile senses, which requires experienced experts or agricultural workers. However, the results of this method vary significantly among different examiners. Moreover, examiners may be fatigued after long-time observation, which will lead to a decrease in the efficiency and accuracy of the identification . The physicochemical method identifies the tea diseases with the techniques of chemistry and molecular biology, for example, fluorescence immunoassay and polymerase chain reaction (PCR) method .
However, physicochemical identification is a destructive method in most cases, meaning that it will destroy the object being examined. Besides, this method is time-consuming and requires professional skills . Therefore, it is urgent to develop an efficient and non-destructive method to identify crop diseases. Recently, researchers have been extensively exploring image recognition technology, computer technology , laser technology, and hyperspectral imaging technology to identify crop diseases . For example, Qin et al.  segmented the image of alfalfa through the K-means clustering algorithm and linear discriminant analysis, and the naive Bayes method and support vector machine (SVM) have also been used to establish disease identification model. Chen et al.  employed wavelet transform and textural matrix analytical calculation to enhance the image of wheat disease and retrieve the disease image by image matching. Tian and Li  used chromaticity moments as eigenvector to identify cucumber disease based on the SVM method. Chai and Wang  used the Bayesian discriminant method to identify early blight, late blight, and leaf mold of tomato by image processing and pattern recognition technologies. Wei  segmented and marked the tea images and classified tea quality through the HSI (hue, saturation, intensity) color model. Based on the color and shape of tea and tea stem, Chen  classified the tea through the proposed multi-feature and multiple classifiers derived from SVM and Bayesian classifiers. Hyperspectral technology integrates the advantages of spectrum identification and image identification that can acquire the internal and external information of the object and lead to its extensive application to monitoring the growth and identifying the diseases of crops. For example, Li  proposed a non-destructive method for measuring tea quality based on machine vision and spectrum technology, through internal component measurement of tea and information diagnosis of tea tree. Chen et al.  established a neural network model to examine the tea quality based on hyperspectral data of tea. Peng et al.  employed the spectrum technology in the rapid examination of tea plant growth and tea quality. Zhao et al.  proposed an efficient method for detecting the slight damage of fruits using spectral imaging technology. Bravo et al.  adopted spectral reflectance based on visible light and near-infrared band to diagnose the stripe rust of wheat in the early stage. Leckie et al.  detected the aphids’ violation of pine tree by the spectroscopic data such as visible light and near-infrared bands.
In this research, using the data from the spectral reflectance of tea, the feature selector was used to remove the irrelevant and redundant data from the high dimensional data, to avoid the Hughes phenomenon . Based on the selected spectral reflectance, a recognizer of tea disease was built to achieve the non-destructive and efficient tea disease identification.
2. Materials and Methods
2.1 Experimental Materials
The tea leaves with disease and healthy tea leaves used in the experiment were acquired from Pingshan Forest Park, Luhe District, Nanjing. The samples were packed into a sealing bag once they were collected. The sealing bag was put into the refrigerator to keep the leaves fresh, and the experiment was carried out in the laboratory in an immediate manner. After the screening and processing by agricultural experts, 80 leaf samples with anthracnose, 72 leaf samples with brown leaf spots, 80 leaf samples with tea white stars, and 60 healthy leaf samples were selected for the experiment. The images of the samples are shown in Fig. 1.
2.2 Experimental Device
The experimental device used in the experiment was a hyperspectral imaging system, as shown in Fig. 2. The device was composed of spectrograph ImSpector V10E, CCD camera GEV-B1621M, optical halogen lamp, camera obscura, control cabinet, electric displacement console, and computer, etc. The spectrum of the hyperspectral camera was between 358 nm and 1,021 nm, and the spectral resolution was 2.8 nm.
2.3 Data Collection
Data collection was performed according to the following steps:
Step 1: acquire the hololeucocratic calibration image W by collecting standard white calibration board with 99% of reflectivity.
Step 2: acquire the holomelanocratic calibration image D from the image behind the lens cover.
Step 3: perform the data collection for all leaf samples and put the samples into the objective table and adjust the table to the appropriate location. The 616-dimensional original hyperspectral image I with a wavelength of 358–1,021 nm was obtained using the hyperspectral image capture software (Spectral Image) .
The parameter settings for the above steps are presented in Table 1.
2.4 Data Processing
The data processing was performed using the following configurations: Computer with RAM of 16 GB and CPU of Intel Core i5-6500, which installed Excel 2010, MATLAB 2016a (MathWorks, Natick, MA, USA) and ENVI 5.3 (Exelis Visual Information Solutions Inc., Boulder, CO, USA).
2.4.1 Image correction
In order to eliminate the interference noise in the process of data collection, the original hyperspectral image I is corrected by using Eq. (1), and the corrected image is recorded as R.
where R is the corrected image, I is the original hyperspectral image acquired by the hyperspectral system, D and W are introduced in Section 2.3.
2.4.2 Relative spectral reflectance of ROI region
Each pixel in a hyperspectral image corresponds to the spectral information of a full-wave band. According to the average distribution of the disease spots in the sample, a region of 200×200 pixels in the center of the leaf was selected as the region of interest (ROI). The average spectral reflectance of ROI was extracted from the 80 leaves with anthracnose, 72 leaves with brown leaf spots, 80 leaves with tea white stars, and 60 healthy leaves, respectively. The results are shown in Fig. 3.
2.5 Research Methods
The One-vs-All method was used to transform the multi-classification problem into a binary classification problem. The first class of multiple classes was marked as positive classes, and all other classes were marked as negative class. Similarly, the second, third, and fourth classes were all treated in this way.
In both the training and testing process, 5-fold cross-validation was used to evaluate the learning performance. In order to ensure the stability and accuracy of the experimental results, the 5-fold crossvalidation was repeated ten times, and their average was used as the evaluation index. Besides, the original data were classified into each fold according to the sample ratio of 8:7:8:6, and the distribution of each fold data was kept by that of the original sample, to ensure that each class data was trained to improve the performance of the methodology (Fig. 4).
The evaluation indexes used in this research included the identification accuracy, recall rate, and F1 score. Through the One-vs-All method, 12 evaluation indexes could be obtained from four categories (Fig. 5).
2.5.1 Feature selection based on decision tree
The data obtained in this research were hyperspectral, and each original sample had 616 features. Each sample often had irrelevant and redundant features, which not only reduced the learning rate and increased the training time but also declined the overall performance of the classifier.
The decision tree has been extensively employed as a suitable feature selection method to divide the subset of samples according to information entropy, which is more suitable for small sample data [21, 22]. The ID3 decision tree was used for selecting the feature from the whole feature space.
The original sample, 616-dimensional original data, was used to build the tea disease recognizer with the decision tree as the classifier. In addition, 10-time 5-fold cross-validation was used to train the tea disease recognizer based on the original sample and decision tree.
The sample after feature selection was obtained from the original sample using the feature selector based on the decision tree. According to the information metric of the decision tree, the dimension number of features was reduced from 616 dimensions to 16 dimensions. The results of the feature selection are displayed in Table 2.
The tea disease recognizer was built using the sample after feature selection and using decision tree algorithms as the classifier. The tea disease recognizer based on the original data and the decision tree and the one based on the selected data and the decision tree were obtained using 10-time 5-fold crossvalidation. The above process is shown in Fig. 6.
2.5.2 Identification of tea diseases based on random forest
Classification is an essential component of machine learning. Traditional classifiers include SVM algorithm , naive Bayesian algorithm , K-nearest-neighbor algorithm  and decision tree algorithm [26,27], etc. However, these classifiers are prone to cause an over-fitting problem, sometimes resulting in reduced accuracy. Therefore, many scholars used multiple models to improve the performance of machine learning, where weak classifiers were used to build strong classifiers. These methods are called ensemble learning .
Random forest algorithm, proposed by Breiman , integrates the Bagging ensemble learning theory  and random subspace method  in a dynamic way. The basic classifier in the random forest is the decision tree, and the random forest consists of several decision trees obtained by ensemble learning and training. The output results of all the basic classifiers formulate the final classification results .
The sample after feature selection constructed the tea disease recognizer using the random forest as the classifier. In the whole process, 10-time 5-fold cross-validation was used. Before the training, we set the number of individual basic classifiers in the random forest as 500. Finally, training was performed on the tea disease recognizer based on the selected data and the random forest. Fig. 7 presents the workflow of the feature selector and the tea disease recognizer.
3. Results and Discussion
3.1 Feature Selection based on Decision Tree
By comparing Tables 3 and 4, after the feature selection, the same learning strategies and verification methods were adopted. Each evaluation index after feature selection was superior to that before feature selection, which indicated that the selected features retained some properties of the original features and reduced the noise caused by irrelevant and redundant features.
3.2 Identification of Tea Diseases based on Random Forest
By comparing Tables 4 and 5, it was found that the identification accuracy, recall rate, and F1 score in Table 5 were increased by 10%, 4%, and 8%, respectively, with the maximum increasing by 23%, 7%, and 15%, respectively, compared with Table 4. The random forest algorithm successfully improved the performance of the classifier, which made the classification results more accurate.
3.3 F1 Score Distribution in Multiple Cases
The relationship between F1 score, identification accuracy, and recall rate was determined by Eq. (4). The identification accuracy and recall rate were adopted to evaluate the performance of the classification model better. The box plot was used to visualize the distribution of F1 scores in various cases. The upper edge and the lower edge of the box plot represent the maximum and minimum values of F1 score, respectively. The discrete points represent the outliers in the data, and the upper and lower edges of the box represent the upper quartile and the lower quartile, respectively, where the horizontal line represents the median. The box plots of F1 score are shown in Fig. 8. Fig. 8(a), (b), (c) and (d) show the identification of the distribution of F1 scores for anthracnose, brown leaf spot, tea white plot, and healthy leaves using the tea disease recognizer based on the original data and decision tree, the one based on selected data and decision tree, and the one based on the selected data and random forest.
It can be observed from Fig. 8(a) that the distribution of F1 score (marked A3) of anthracnose using the recognizer based on the selected data and the random forest was superior to the other two cases (A1 and A2). In terms of identifying brown leaf spot, tea white star, and healthy leaf, the optimal distribution of F1 score was achieved in the tea disease recognizer based on selected data and random forest.
3.4 Discussion and Future Work
The experimental results showed that the feature selection strategy based on the decision tree was fully able to reduce the dimension of high-dimensional data. Besides, it was also shown that the decision tree method performed well in being a good classification strategy and in feature selection. The tea disease recognizer based on random forest could effectively learn the information from training data, and then identify the diseases that tea might have. From the evaluation indexes of identification accuracy, recall rate, and F1 score, the best experimental results were achieved in many experiments under the methodology with feature selector based on decision tree and the tea disease recognizer based on random forest, laying a foundation for the high-efficiency and non-destructive identification of crop diseases.
Considering that F1 score can be calculated by the identification accuracy and recall rate, the performance of the classification model was well evaluated by F1 score in this research. Besides, the difference in index value could not reflect the advantages and disadvantages of the model.
In addition, there is a lack of public datasets for the comparison of similar research work in the field, such as the ImageNet dataset  and the COCO dataset  for deep learning research. Some similar research often used different data, making the results less comparable.
With the continuous development of artificial intelligence and embedded technology, machine learning algorithms can run on embedded system platforms in real-time. Transplanting the implementation algorithms proposed in this paper into plant protection drones can better reduce the impact of tea diseases on tea quality and yield, and reduce the economic losses caused by tea diseases.
The hyperspectral images were obtained, and the relative spectral reflectance of sensitive bands in ROI was extracted as the feature. The decision tree was used as a feature selection method to remove irrelevant or redundant features. The 16-dimensional features were selected from 616-dimensional features, and the decision tree was used as the classifier to learn features before and after feature selection. The F1 score was increased by an average of 3% when using the decision tree for feature selection, indicating the good ability of the decision tree in feature selection.
Compared with the tea disease recognizer based on original data and decision tree and the one based on selected data and decision tree, the increased performance was observed in the one based on selected data and random forest. In the end, the average F1 score of tea disease identification was over 80%.
This paper is supported by the Fundamental Research Funds for the Central Universities of China (No. KYTZ201661), China Postdoctoral Science Foundation (No. 2015M571782), and Jiangsu Agricultural Machinery Foundation (No. GXZ14002), University Student Entrepreneurship Training Program of Jiangsu Province (No. 201810307031T).
He is Ph.D. and Associate professor. He received the doctor degree in Nanjing Agricultural University (China) in 2013. He currently works in Nanjing Agricultural University. His interests and research are focused on image processing and pattern recognition. He has authored over 20 technical journals in the area of image processing and pattern recognition.
He received B.S. degrees in college of Engineering from Nanjing Agricultural University in 2019. Since September 2019, he is as a graduate student at Southeast University, majoring in computer science. His current research interests include big data processing based on machine learning and deep learning.
She is Ph.D. and Associate professor. She received the doctor degree in Nanjing Agricultural University (China) in 2014. She currently works in Nanjing Agricultural University. Her interests and research are focused on machine learning and deep learning. She has authored over 10 technical journals in the area of machine learning and deep learning.