1. Introduction
In computer vision, image classification represents a basic and key task of visual recognition [1]. Various researchers have focused on image classification [2–6], for instance, to extract and classify regions of interest from satellite images [2]; to diagnose melanoma by learning various skin pictures in the medical domain [3]; and to classify animal species by learning animal images [4–6]. In particular, image classification aims to classify or predict the category of the test object by learning various photos through artificial intelligence. Several methods have been reported to enhance the accuracy of image classification. A promising method is to collect a large dataset. Notably, when classifying animal species, the collection of large datasets is challenging as it is difficult to record animals in an adequately static manner [7,8]. Consequently, researchers have attempted to dehaze photos using artificial intelligence [9].
However, even for dehazing, a dataset of hazy photos must be used to train artificial intelligence methods [10]. In this context, data augmentation can be performed to supplement insufficient datasets and enhance the accuracy of image classification. In the field of image processing, standard data augmentation is based on an image editing method. However, when standard data augmentation is applied to animal species classification, characteristics such as spots and mixed colors, which are extroverted characteristics in animals, cannot be reflected.
Thus, data augmentation methods based on generative adversarial networks (GANs) have been established [11,12]. GAN models can generate new images and have been proven to be effective in data generation tasks [12]. For instance, certain researchers generated images using GANs to obtain illustrations based on the text of children's books [13]. In data augmentation through GANs, a GAN image is created based on the original dataset and used as an image classification learning dataset.
Among other methods to increase the image classification accuracy, image classification has been performed using a deep neural network (DNN) model [14–16]. With progress in research on DNNs, various DNN models have been established [17–20].
In this study, 10 dog breeds are selected, and we attempt to supplement an inadequate dataset by reflecting the extroverted features of animals by using a GAN to increase the image classification accuracy. The performance of different DNN models in terms of image classification is evaluated.
The remaining paper is organized as follows: Section 2 introduces the related research on image classification through DNN models and data augmentation using a GAN. Section 3 describes the experiment setup and environment. Section 4 describes the datasets and process of selecting dog breeds, along with the GAN image creation through CycleGAN and data augmentation. Section 5 describes the composition of the learning group and performance evaluation process for each DNN model. Section 6 presents the concluding remarks.
2. Related Work
2.1 Image Classification via DNN Models
Various DNN models have been used to increase the accuracy of image classification [14–16]. Certain researchers attempted to learn malaria-infected and normal cells through ResNet and classify the infected cells [14]. Other researchers attempted to classify food categories by learning food photos through MobileNet or disease types by learning the leaves of diseased plants through NASNet [15,16]. With the emergence of image classification methods based on DNN models, the performance of these models has been extensively evaluated [21,22]. In 2018, the image classification and prediction performances of DNN models were evaluated using the ImageNet-1k validation set, which stores 14 million images. Moreover, the top-1 and top-5 accuracies for the models were specified against the number of operations in terms of giga-floating point operations per second (G-FLOPs) [22].
Notably, the top-N accuracy refers to the ratio of correct predictions when N classes are predicted through softmax in comparison with the true class. Fig. 1(a) shows that the top-1 accuracy for MobileNet_v2, ResNet-152, InceptionResNet_v2, and NASNet_Large are 71.81%, 78.25%, 80.28%, and 82.5%, respectively. Fig. 1(b) shows the top-5 accuracy for the models. Because the exact values were not presented in the abovementioned study [22], approximate values for MobileNet_v2, ResNet-152, InceptionResNet_v2, and NASNet_Large are presented as 90.5%, 94.5%, 95.5%, and 96%, respectively.
Based on the abovementioned performance evaluation study, four DNN models are selected: MobileNet_v3_Large [17], ResNet-152 [18], InceptionResNet_v2 [19], and NASNet_Large [20]. MobileNet_v3_Large exhibits a 3.2% higher accuracy than that of MobileNet_v2 owing to the introduction of a nonlinear function, h-swish [17].
Performance evaluation of various DNN models: (a) top-1 accuracy and (b) top-5 accuracy.
The ResNet model introduces residual learning, a “shortcut” concept, to solve the degradation problem that occurs as the learning layers are stacked and the problems of decreasing accuracy and increasing loss encountered in multilayered structures [18]. InceptionResNet_v2 exhibits a 3% higher accuracy than that of Inception_v3 owing to the addition of a residual block to the Inception_v3 model composed of Inception modules [19]. NASNet_Large can be applied to various datasets by designing blocks through a recurrent neural network and reinforcement learning [20]. In the abovementioned study [22], NASNet_Large exhibit the highest top-1 and top-5 accuracies. In this study, we compare the image classification performance of the DNN models considering these metrics.
2.2 Data Augmentation with GAN
In the field of image processing, data augmentation is an image editing method. For example, the amount of photo data can be increased through resizing, rotating, cropping, and random erasing methods [23,24]. Certain researchers attempted to classify images using data augmentation and evaluated the classification accuracy [25,26]. Based on random image cropping, the image classification test error rate of a dataset subjected to data augmentation was approximately 23% lower than that of the existing dataset without data augmentation [25]. Through random erasing, the test error rates could be decreased by approximately 9% [26] (Fig. 2).
However, standard data augmentation generates limited data [11]. For example, when augmenting an animal image using standard data augmentation techniques, characteristics such as spots and mixed colors, which are extroverted characteristics of animals, may not be reflected. Therefore, data augmentation methods based on GANs have been proposed [11,12]. Various GAN models such as CycleGAN, progressive growing GAN (PGGAN), unsupervised image-to-image translation (UNIT), and multimodal UNIT (MUNIT) have been proposed [27]. Recently, in the medical field, data augmentation was performed through GAN frameworks [28,29].
Example of standard data augmentation.
Moreover, MUNIT and PGGAN were used to create and learn GAN brain images with tumors and detect tumors [28]. The medical image classification accuracy was attempted to be enhanced using CycleGAN and UNIT [29] (Fig. 3).
Fake dog image creation using GAN.
Notably, vanilla GAN exhibits a limitation that the datasets are generated in pairs. In a paired dataset, the resolution and shape are identical. In the case of a GAN using a paired dataset, learning must be performed by mapping the input and output images. If the resolution and shape do not match, different images may be derived [30]. CycleGAN can unpair the datasets [31]. Therefore, CycleGAN is more suitable for dog photos, for which paired datasets cannot be generated owing to the various angles and appearances in the images. Fig. 4 shows an example of paired and unpaired datasets.
Examples of paired and unpaired datasets.
In addition to the requirement of constructing paired datasets, vanilla GANs exhibit a key problem. A GAN consists of two networks: generator and discriminator. The generator transforms the data to pass the discriminator. In this scenario, the label or distribution of the converted data is biased toward a specific mode, known as mode collapse [32,33]. CycleGAN applies cycle-consistency and adversarial losses to solve the mode collapse problem. The cycle-consistency loss function is defined in Eq. (1).
In Eq. (1), X and Y denote domains, x denotes the samples belonging to X, y denotes the samples belonging to Y, and G and F denote the mapping functions indicating translators or generators, respectively.
In this study, we perform data augmentation by using a GAN in an image classification task by fusing the dog photos generated through CycleGAN and standard data augmentation.
3. Experiment Overview
First, a dataset is constructed by selecting 10 dog breeds from the Stanford Dogs Dataset and Oxford- IIIT Pet Dataset. Subsequently, the dataset is separated and used for GAN image generation. Next, GAN images are generated through CycleGAN. Four learning groups are created using the original datasets and GAN image datasets. The number of images is increased through standard data augmentation. Finally, we train the four DNN models and evaluate the classification accuracy (Fig. 5). The experimental environment is summarized in Table 1.
Process flow of the experiment.
4. CycleGAN-based Data Augmentation
4.1 Datasets
The Stanford Dogs Dataset and Oxford-IIIT Pet Dataset are used. Ten dog breeds (Basset Hound, Beagle, Boxer, English Setter, German Shorthaired, Keeshond, Leonberger, Miniature Pinscher, Pomeranian, and Pug), which are included in both datasets, are selected. Breeds with only a single color are not included. Moreover, pictures involving a person with a dog, pictures involving two dogs of different breeds, and pictures with effects such as sepia and grayscale are excluded. The resulting dataset has 3,000 photos, with 300 photos per dog breed. The contents of the dataset are presented in Table 2.
Datasets used in the experiment
4.2 GAN image creation using CycleGAN
The learning configuration of CycleGAN is divided into TrainX and TrainY. GAN image generation is performed in XY style and YX style based on the two trained groups. In this study, single-color and mixed color images are configured as TrainX and TrainY, respectively, and 1,000 GAN images are generated, as shown in Fig. 6.
Generated images via CycleGAN.
4.3 Image data augmentation using Augmentor
Augmentor is a package that supports Python and Julia and provides image editing functions required for data augmentation [34]. In the dataset subjected to GAN image creation and grouping (described in a Section 5), the number of images is increased using Augmentor. The image is resized to 256×256, rotated 90° three times, and reversed left and right and up and down. For each rotation and reversal, two erasing steps and one distortion are performed. Random image cropping is not performed. Because the size of the dog in each image is different, when random image cropping is applied, only the background is cropped or only specific areas such as eyes, feet, and the nose are cropped. Unfortunately, this process leads to a decreased learning accuracy. Through this process, the amount of image data can be augmented by 24 times. The results of data augmentation based on Augmentor are shown in Fig. 7.
Results of standard data augmentation using Augmentor.
5. Dog Classification
5.1 Training Group Configuration
One-third of the dataset of 10 dog breeds consisting of 3,000 pictures is selected. GAN images are created by implementing CycleGAN over the 1,000 photos, and six groups are formed. In the nonstandard data augmentation datasets (NAD-I and NDA-II), standard data augmentation is not applied. Groups I and III are composed of 2,000 and 3,000 original images, respectively, and used to evaluate the basic performance of the DNN model. Groups II and IV contain 2,000 original images + 1,000 GAN images and 3,000 original images + 1,000 GAN images, respectively, and are used to compare the performance of the DNN model when the GAN images are added to the dataset. Subsequently, the amount of data is augmented through Augmentor. The composition of each learning group is shown in Table 3.
Composition of training groups
5.2 Comparison and Performance Evaluation for Each Model
For each group, training is conducted over MobileNet_v3_Large, ResNet-152, InceptionResNet_v2, and NASNet_Large. Using the ImageGenerator function of TensorFlow, the pixel size is preprocessed in the range of -1 to 1, and the data augmentation function of ImageGenerator is not applied. For each DNN model, ImageNet weights are set as zero. The batch size is 16; 100 epochs are implemented; and the learning is terminated when the validation accuracy does not increase within five epochs, through the early_stopping function. The graph of the learning result for each model is shown in Fig. 8.
Performance evaluation of different models. (a) and (b) show the top-1 and top-3 accuracies for each model, respectively. (c) shows the final loss when the highest prediction accuracy is achieved. (d) shows the number of parameters for each model and learning rate per epoch.
In Table 4, among the four DNN models, the InceptionResNet_v2 model exhibits a satisfactory classification performance. When data augmentation is performed for MobileNet_v3_Large, the top-3 accuracy decreases. ResNet-152 exhibits the lowest overall performance; however, the accuracy is enhanced when data augmentation is performed. InceptionResNet_v2 exhibits the highest classification accuracy. However, when data augmentation is applied, the accuracy is not enhanced. NASNet_Large exhibits the second-highest accuracy among the four models, and the accuracy is enhanced through data augmentation. The top-3 accuracies MobileNet_v3_Large and InceptionResNet_v2 decrease by 1.2% and 1.36%, respectively and those of ResNet-152 and NASNet_Large increase by 0.91% and 1.92%, respectively. We compared the accuracy of NDA-I and NAD-II, which unperformed standard data augmentation and Groups I and II, which performed standard data augmentation. The top-3 accuracies NDA-I and Group I showed a difference in accuracy of about 1.88 times. NDA-II and Group II showed a difference in accuracy of about 1.68 times.
Training result of different DNN models
6. Conclusion
We compare the image classification accuracy of four DNN models after generating GAN images through CycleGAN, implementing the standard data augmentation process, and performing training on the DNN models. Most of the existing image classification studies are focused on classifying images pertaining to a large category. In contrast, this study is focused on classifying images of dogs pertaining to a detailed category. GAN images are generated using CycleGAN to reflect dog characteristics such as spots and mixed color, which cannot be reflected through standard data augmentation techniques. Subsequently, standard data augmentation is performed to increase the amount of image data. The augmented data are applied for DNN model training, and the image classification accuracy is compared.
When data augmentation based on GAN is performed, the accuracy for certain models increases. However, the accuracy enhancement based on GAN image data is not adequate for these data to replace the original data, and the accuracy of several DNN models decreases. To address this problem and increase the accuracy of image classification, it is necessary to use other GAN models or enhance the standard data augmentation process.