1. Introduction
Convolutional neural networks (CNN) [1,2] have shown outstanding performance in body detection including facial recognition [3-7] and disease classification [8-10]. Several studies [11- 12] have demonstrated that the pre-trained model in ImageNet can be transferred to medical images. Transfer learning is a widely used method for training datasets [13]. By using a model that has been pre-trained with sufficiently large-scale datasets during the initialization process of features, small-scale datasets can be trained more efficiently than they could have been from scratch. However, recent CNN models have very deep structures, with more than hundreds of layers, so it is not easy to retrain these models with small-scale datasets.
Generally, it is difficult to collect large-scale medical imaging datasets due to patient privacy or ethics-related concerns. In the case of rare diseases, small-scale datasets are inevitable. Furthermore, the networks for medical images are applied locally (e.g., a local community). In other words, it is sometimes unnecessary to collect large-scale datasets, such as those collected by ImageNet [14].
For example, a hospital in one country does not need much information about patients from other countries. In addition, local hospitals only need information from local residents. Furthermore, it is often the case that the necessary infrastructure for collecting medical images is not well established. Therefore, it is necessary to consider an effective deep learning method for the training of small-scale medical imaging datasets.
Fine-tuning is a method that retrains all layers and classifiers within a network. It is a method that is typically used in datasets from various fields within transfer learning. However, the possibility of overfitting is very high when fine-tuning the pre-trained model to a small-scale dataset. To address this limitation, we utilized feature extraction (FE). FE is a method that modifies last fully connected layer to fit the number of output classes of dataset and maintain existing parameters of convolutional layers without updating them [15]. The FE method for transfer learning can be thought of in two ways: (1) to prevent updates of all convolutional layers, to only extract features, and then to train only the linear classifiers, and (2) to divide the convolutional layer into levels and then to train each level selectively. The convolutional layers can be divided into a low level, a middle level, and a high level, with deeper-level extraction displaying features that are more complex and closer to the shape of the object [16]. Because layers at each level have different features, it is necessary to determine which level is the most effective at retraining small-scale medical imaging datasets. Studies using transfer-learning for medical imaging [11,12, 17,18] all have fine-tuning method to study all layers. All studies show good performance but fine-tuning to small-scale medical imaging dataset can lead to overfitting, and updating all layers consumes unnecessary computing power.
Therefore, we devised a method that consumes less computing power without overfitting in learning of small-scale medical image dataset. We propose that mid-level FE, which retrains features such as the texture of the image and parts of the object, is the most effective model. We compared the transfer-learning model, implemented with fine-tuning, and the FE of each level to the proposed method. And we visualize and compare all convolutional layers in mid-level FE and fine-tuning model. As a result, we confirmed that our model has the ability to conduct efficient, small-scale medical image analysis and it clearly extracts the features of the medical image.
In this paper, we present contributions as follows:
For training a small-scale medical image dataset, we propose mid-level FE method that only retrain middle level layers. Our proposed method shows good classification accuracy and reduce computation power by showing the fastest convergence than another baseline. This method also is robust to unseen dataset.
Through visualizing layers of network, we confirmed that our network train valid features of lesion area.
The rest of the paper is organized as follows: in Section 2, we provide some related works in structure of transfer-learning, several studies applied transfer-learning to medical imaging, and visualizing method for CNN. We describe detailed design of our proposed network which layers are updated and rationale of that based on visualization studies in Section 3. A comparison of the performances of the networks efficiency and results of visualizing all layers in the network in Section 4. And we present the conclusions in Section 5.
2. Related Works
Before CNN have shown tremendous performance, most studies applied pattern recognition or machine learning approach to computer aided diagnosis (CAD). There are studies that apply local binary pattern (LBP) [19] and bag-of-visual-words (BoVW) [20,21] to the disease classification, and there is also a disease classification study using SVM [22] method. After the advent of AlexNet [2], CNN has been used in most CAD applications, many researchers are studying CNN’s excellent FE capabilities.
2.1 Transfer Learning
CNN has a significant number of parameters, so it needs large-scale datasets to learn it. Transfer learning [23] is a means of training new datasets that are limited in size. Rather than randomly initializing a new network, the parameters of the pre-trained model are called up and used during the initialization process. In this case, the pre-trained model should be learned in a very large dataset in order to guarantee good performance. In one study, researchers [23] conducted an experiment to transfer the parameters of AlexNet [2], as learned in ImageNet [24], to Pascal VOC 2007 and 2012 datasets [25]. Their results demonstrated that the transfer-learning model from ImageNet is much better than a random initialization model. In other words, instead of random initializing, transfer learning can be used effectively for learning a small-scale dataset by transferring the model trained with a huge dataset. In terms of implementation, transfer learning can be implemented through the use of fine-tuning, which updates all the layers to adjust to a new dataset, and FE, which extracts input features [15] and learns only the fully connected layer. When implemented with FE, in addition to learning only the fully connected layer, FE can specify which layer to learn. This is a very important process. Fig. 1 shows the transfer-learning structure as fine-tuning and FE.
Transfer-learning structure as FE and fine-tuning.
2.2 Transfer Learning for Medical Imaging
Research applying transfer learning to the learning of medical images has become increasingly common. Bar et al. [11] applied AlexNet [2] within ImageNet to chest X-rays. They compared the CNN structure with other methods, namely, GIST [26] and BoVW [27]. Of these, CNN performance was the best. They thus demonstrated that it is possible to transfer non-medical datasets to medical imaging datasets. In addition, CheXNet [17] trained pneumonia pathology in chest X-rays with better accuracy than doctors could produce using transfer learning. They chose a DenseNet [28] model pre-trained with ImageNet and updated all the layers. Various studies have also been carried out on datasets other than X-ray datasets. Tan et al. [18] studied classification in bronchoscopy images using sequential fine-tuning based on DenseNet, obtaining good results. Shin et al. [12] showed that CifarNet [29], AlexNet, and GoogLeNet [30] learned in ImageNet can be used as pre-trained models for transfer learning to CT datasets. However, all the aforementioned studies have implemented transfer learning through fine-tuning, and they have not been compared with each convolutional layer level. Therefore, we should consider whether fine-tuning is appropriate for small-scale medical image sets and compare the training results according to the characteristics of each convolutional layer level.
2.3 Visualizing a CNN Network
AlexNet is an early deep CNN model, but its structure is simple and its performance is good, so it is actively used in various studies. The aforementioned studies have also conducted transfer learning using AlexNet. AlexNet has a structure simpler than the latest deep CNN network, consisting of five convolution layers, three max-pooling layers, and three fully connected layers.
However, compared to non-CNN models like general neural network models, AlexNet is not overly simple. The CNN model has a complex structure, which makes it difficult to understand intuitively the process of learning CNN. For example, FE, the operations between layers, and the very high dimensional tensor make it hard to see if the network is being learned properly. Therefore, to interpret the learning process of CNN networks, many efforts have been made to visualize inside the network [11,12,16, 31] . Many researchers [16,31,32] used AlexNet in CNN visualizing studies. Through these studies, we can see that the higher-level layers in the network learn object-specific features, and the lower-level layers in the network learn more general features, such as lines, edges, corners, etc.
Chitra et al. [10] have proposed a deconvolution structure that visualized the feature map by unpooling methods. Unpooling is the process of returning the rectified feature map through the rectified linear units (ReLU) function back to the original image dimensions. Specifically, unpooling proceeds by the following two passes. First, the pooled location is stored in the max-pooling process (the variable that stores the location is called the switch). Second, the feature map is restructured by arranging the stored switches to the appropriate locations. Once the unpooled feature maps are obtained, they need to be rectified through ReLU. This process of rectification ensures the activation map. Finally, the refined features undergo filtering. Filtering is performed using the reversed version of the convolutional layer and the transposed version of the filter of the existing convolutional layer [31]. The result of visualizing each convolutional layer can be obtained by performing all the unpooling, rectification, and filtering processes described above. In this paper, we note the mid-level FE capabilities of CNN and visualize our model through the deconvolution structure of Zeiler et al. [16].
3. Mid-level FE for Medical Image Transfer Learning
The CNN model effectively extracts mid-level image representation. However, as the parameters in training this type of network are very large, learning from an insufficient number of small-scale datasets can lead to overfitting [6]. This problem can be similarly applied to fine-tuning, which retrains all layers. Medical images are limited in their dataset size; furthermore, their features are completely different from real-life images, such as are present in ImageNet. As a result, we determined to explore a method that extracts mid-level image features effectively and prevents the overfitting of small-scale datasets.
First, we considered utilizing the construct transfer-learning network with the FE method to prevent overfitting by reducing the number of parameters to retrain. Second, we selected which convolutional layer we planned to retrain and update. The low-level layers showed the edge and color of the image, the mid-level layers represented the texture of the image or the specific parts of the objects, and the high-level layers showed the larger part of the objects or the entire objects [8,10,16]. Therefore, we reasoned that the mid-level layers could represent both common and more class-specific features. As a result, we determined to use the mid-level FE method.
Fig. 2 shows the entire network structure of the mid-level FE. The network is based on using AlexNet [2] in ImageNet, which has a simple structure that allows for the comparison of the effects of training the mid-level convolutional layer (CL). According to [9], AlexNet’s CL3 and CL4 refer to textures, such as a mesh pattern, and more complex and class-specific features of the objects in the image. We only trained these two layers. That is, we chose CL3 and CL4 as the mid-level layers from the five CLs of AlexNet. We updated our network by retraining only the mid-level layers, CL3, CL4, and the last fully connected layer. CL1 and CL2, which are low-level layers, and CL5, which is a high-level layer, were frozen, and they kept the parameters of AlexNet within ImageNet. In addition, the classifiers (FC1 and FC2) were frozen, and only FC3 was updated.
The benefit of layer freezing also appears in computation. Table 1 shows the detailed CL architecture of our mid-level FE. The number of parameters of CL3 and CL4 occupy 62% of the total CL parameters. By training the mid-level layers, we can greatly reduce the number of parameters to update.
Mid-level FE structure. Our network is trained on ImageNet. We only train mid-level layers (CL3, CL4) and the last fully connected layer.
The architecture of convolutional layers (CL) in mid-level feature extractor
3.1 Dataset and Training
Our dataset included frontal chest X-ray image data, labeled pulmonary tuberculosis (TB) or non-TB from [33]. There were two datasets, the Shenzhen dataset, from Shenzhen No. 3, People’s Hospital (Guangdong, China), and the Montgomery dataset, from the Department of Health and Human Services of Montgomery County (Rockville, MD, USA). While the Shenzhen dataset consisted of 336 TB and 326 non-TB, the Montgomery dataset consisted of 103 TB and 296 non-TB. In addition, the Shenzhen dataset was used as a training and validation set. It was randomly split into training (80%) and validation (20%) sets. In addition, the Montgomery dataset was used only as a test set to examine the possibility of overfitting because the features of the Montgomery dataset are very different from those of the Shenzhen dataset. These two datasets have very different distributions, as shown in Fig. 3. On the basis of the AlexNet’s input size, all input images in the dataset were converted to a size of 224×224. We also normalized all the images using average and standard deviation from the ImageNet dataset.
Gray-scale histograms of (a) Shenzhen dataset and (b) Montgomery dataset.
We experimented with the performance of our methods, mid-level FE, by implementing a classification model of TB. This is a binary classification that registers the existence or absence of TB as 1 or 0 labels for each input datum. Thus, the output of our model should be one class of TB. Our loss function is un-weighted binary cross-entropy for our binary classification problem, and it is defined as
Here [TeX:] $$x_{n}$$ is frontal chest X-ray data, [TeX:] $$y_{n}$$ is output that only be 0 or 1 and N is the batch size by using the Adam optimizer with [TeX:] $$\beta_{1}=0.9 \text { and } \beta_{2}=0.999$$.Our initial learning rate was 0.001, decaying by 0.1 every 7 epochs.
The experiment was separated into nine cases, depending on whether each CL was frozen. Specifically, it was divided into each level FE and each layer FE. The low-level FE was up to CL2, mid-level was up to CL4, and high-level was up to CL5. Table 2 shows which layers were updated or not updated during the training. For example, mid-level FE, our method, involved updating only the CL3 and CL4 layers. The term CL3-FE in the table means that only the CL3 layer was updated, the term high-level FE means that only the CL5 layer was updated, and the term H-M level FE means that the high- and mid-level layers, from CL3 to CL5, were updated. Dividing the experiment cases into levels and layers shows which levels and layers are most effective for learning a small-scale medical images dataset. All cases were designed to allow for the updating of only the 3rd (last) fully connected layer, assuming the possibility of similar experiments in more restrictive situations.
The nine experiment cases, indicating update/freeze status
3.2 Visualizing Network
Fig. 4 shows our feature visualization structure. First, convolution proceeds according to the structure of AlexNet. The features generated from 1st CL and 2nd CL are passed consecutively through the ReLU function and max pooling. Similarly, features created in the 3rd, 4th, and 5th CLs passed through ReLU. However, last 3rd max pooling is performed at the end of the convolution path, after the last 5th CL. The features created in the last max-pooling layer are entered into the deconvolution path through max unpooling. The CLs of the deconvolution path are the transposed layers of the existing CLs. We call this the deconvolution layer [16]. For example, if the input of the 1st CL is 3 and the output is 63, the 1st deconvolutional layer has the converse inputs and outputs of 63 and 3. Thus, the deconvolution path visualizes the feature map by reversing the convolution process. This deconvolution process can be considered to be symmetrically constructed as a U-Net [34] and SegNet [35]. However, the deconvolution structure implemented in this paper is not a symmetric model. The convolution path and the deconvolution path are performed independently. Because training both convolutional and deconvolution path requires a lot of computing power, we simplified deconvolution structure for visualizing network. Our deconvolution layers are shared by transposing the weight of the paired convolutional layers without any training.
The structure of simplified deconvolution network for visualizing.
4. Evaluation and Analysis
The experiment of this paper is divided into three parts using nine cases in Table 2. First, comparison of validation loss, second, accuracy and overfitting test through comparative receiver operating characteristic (ROC) curves, third, visualization of each layer.
4.1 Validation Loss
Fig. 5 compares the validation losses for each level’s FE and fine-tuning, and Fig. 6 compares the validation losses for each layer’s FE. The mid-level FE (ours) shows a stable convergence at 80 epochs; as a result, all the graphs are compared to epoch 80. The loss was the lowest and most stable for FE in the mid-level. Mid-level FE (in Fig. 5), CL3-FE, and CL4-FE (in Fig. 6), which updated in the middle layer and extracted class-specific and generic features, show good performance, compared to other methods. Mid-level FE shows the lowest losses, demonstrating a maximum of 0.4 and a minimum of 0.02. Furthermore, it displays a stable tendency to converge after 60 epochs. CL3-FE and CL4-FE show low losses in Fig. 6. In particular, CL3-FE demonstrates a minimum of 0.024, a small difference from mid-level FE. However, CL3-FE was less stable than mid-level FE. Its loss values rose to the 0.5 in training. This result shows that mid-level learning is more effective than learning each single layer in the middle level. Meanwhile, the fine-tuning case that updates all layers (in Fig. 5) demonstrates that its learning is unstable and that its loss is high. Because the dataset was small, it lacked an epoch. We increased the epoch to 200, but they did not converge.
This shows that fine-tuning is not suitable for the learning of small-scale medical images datasets. High-level FE (in Figs. 5 and 6) and H-M level FE (in Fig. 5), which updated the upper layer in the network, tends in general to be trained stably, except for the fact that loss values fluctuate at the beginning of the training in the high-level FE case. In addition, their loss values stay higher than those of mid-layer FE. This result shows that training high-level layers that extract entire objects of input is useful for small-scale medical image sets but not as good as training the mid-level layers. Low-level FE (in Fig. 5), CL1-FE, and CL2-FE (in Fig. 6) were poorly trained cases; their losses were too high, and they could not be considered to be stable. That is, small-scale medical image datasets are insufficient in the training of low-level layers. Because low-level features from one medical image (such as lines or patterns) are too general, they are not distinguishable from the features of other images.
Therefore, we confirmed that learning mid-level layers is more effective than learning other levels. This shows that the mid-level FE has the smallest number of epochs to converge, in other words, has lowest computing costs, has the best loss performance, and has a stable learning tendency.
Validation loss graph for each level’s FE.
Validation loss graph for each level’s FE.
4.2 Overfitting Test with ROC Curve
Fig. 7 presents ROC curves performed on the Montgomery test set. In addition, Table 3 shows the AUC values of each case. The area under the ROC curve (AUC) of the mid-level FE is 0.87, and the AUC of the fine-tuning is 0.77; in other words, the difference is 0.1, and the accuracy of mid-level FE is higher than the fine-tuning. This demonstrates that our method outperforms the fine-tuning method. In addition, it can be observed that mid-level FE does not overfit the original training dataset. As shown in the previous validation loss experiment, low-level FE performance is very low. This implies that overfitting has occurred. The H-M level FE performance is better than fine-tuning. However, it is lower than the mid-level FE. Therefore, our proposed method, mid-level FE, showed the best performance to prevent overfitting.
We found the mid-level FE to be the most effective method in implementing transfer learning for small-scale medical images. The mid-level FE trained more stably at the mid-level than did the model learned at the other levels. In addition, it demonstrated the lowest loss performance and fastest convergence in comparison with the other cases. Therefore, the mid-level FE can be an effective means of training small-scale medical images.
Receiver operating characteristic curves of mid-FE and fine-tuning on test set.
4.3 Visualizing the Mid-level Feature Extractor
Through two previous experiments, we found the mid-level FE to be the most effective method in implementing transfer learning for small-scale medical images. The mid-level FE trained more stably at the mid-level did than the model learned at the other levels. In addition, it demonstrated the lowest loss performance and fastest convergence in comparison with the other cases. Therefore, this result demonstrates that the mid-level FE can be an effective means of training small-scale medical images. We visualized the feature map inside our network to see which features the network was learning. It explains why our method performs well.
We visualized input images through the deconvolution structure. We arbitrarily selected nine input images and randomly selected one of the feature maps extracted from the input image. Figs. 8 and 9 show the results of visualization in the mid-level FE. The features of low-level layers (CL1, CL2) are shown as the lines of the ribs and vertebrae. In addition, we can see contours of the lungs. Without the layer update, it catches the features well because the low-level layers learn general features, such as lines, edges, and corners (in Fig. 8). The mid-level layer (CL3, CL4) with updating of parameters finds more object-specific features than the low-level layer does. The visualization results below show that the right and left lungs are activated. This shows that transferring the parameters from the real-world dataset (ImageNet) to the medical imaging dataset was successfully performed (in Fig. 9). In the case of the high-level layer (CL5), the entire lung is caught. Even though layer updating is not performed, it shows that the feature is properly activated (in Fig. 9).
Fig. 10 is a visualization result of the CL3 and CL4 of the fine-tuning model. Unlike the visualization results of the proposed method, the features did not appear correctly on all layers in the fine-tuning. In other words, the performance of the pre-trained model has been lost through inadequate fine-tuning. This result shows that fine-tuning performed on a small-scale medical image dataset is very dangerous. Fine-tuning can be the worst approach unless a dataset capable of updating all the parameters of the model is guaranteed.
Visualization results for low-level convolutional layers (CL1, CL2) in the mid-level feature extractor.
Visualization results for mid-level convolutional layers (CL3, CL4) and high-level convolutional layer (CL5) in the mid-level feature extractor.
Visualization results of CL 1 and CL 4 in the fine-tuning model.
5. Conclusion
We propose a mid-level FE approach to the training of small-scale medical imaging datasets. To evaluate the performance of this method, we compared it with low-level FE, high-level FE, and fine-tuning methods. Compared with the other methods, our proposed method shows the lowest amount of loss between 0.4 and 0.02, the most stable training tendency, and the lowest computing costs for convergence. In the experiment pertaining to overfitting, we used different datasets from the training set; the AUC obtained from the test is 0.87. Our method also prevents overfitting; our results are 0.1 higher than the fine-tuning method. We also conducted a visualization experiment of convolution layers in our method using a deconvolution structure. Our method can be verified to extract meaningful features throughout the network. On the other hand, the fine-tuning method did not extract the features correctly on all layers. Thus, our method is shown to be an efficient alternative to the classification of small-scale medical imaging datasets through its prevention of overfitting, its maintenance of accuracy, and reduction in computing costs. For future work, we need to research the training of neurons inside the layer. If we can selectively train valid neurons inside the mid-level layers, more efficient learning is possible.
Acknowledgement
This work was supported by the Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korean government (MSIT) (No. 2017-0-018715, Development of AR-based Surgery Toolkit and Applications).