1. Introduction
In the information technology (IT) field, there is a consistent need for high-quality up-to-date image datasets. For machine vision or smaller student projects, an image dataset of sufficient quality is required. There are three primary methods for generating such datasets, namely, manual, semi-automatic, and automatic methods. Manual dataset generation is the most time-consuming but also the most effective method; every individual image is guaranteed to be of high quality. This is because the creator of the dataset can determine whether the image meets the selection criteria. An automatically generated dataset uses a set of algorithms to collect images. The most common approach is using a web scraper to obtain images of interest via a search engine. The semi-automatic process is characterized by two approaches. The most modern approach involves manually creating a small high-quality dataset, and then augmenting through the automatic method to gather a larger dataset. The other approach entails using an automatic algorithm to collect images and going through them to handpick the appropriate high-quality images. As increased optimization becomes more commonly required in practice, the need for new automatic methods will increase. This can be observed in the literature, such as, the commercially available study conducted by Rosebrok [1].
Simple web scraping methods proposed by Thomas and Mathur [2] and Glez-Pena et al. [3] provide an easy approach for automatically collecting an expansive image dataset based on any possible query. However, the precision of these methods is typically insufficient for large-scale use, as observed by by Schroff et al. [4]. This study proposes a fully automated image dataset collector that includes image ranking to avoid selecting inadequate images. However, the solution proposed by Schroff et al. [4] is constrained by certain search limits, yielding low accuracy with multiple tests. Over the next sections, we discuss web scraping search engines, query expansion, dataset quality, and possible issues that occur during web scraping. Finally, we will recommend creation guidelines for a fully automated image dataset generator.
2. Overview of Web Scraping-based Image Dataset Generation
2.1 Automated Dataset Generation
To achieve automatic dataset generation, two approaches have been proposed, namely, full-automatic and semi-automatic. Schroff et al. [4], suggested a completely automated system that combined an initially obtained result from a query through search engines and re-ranked them based on text/metadata surrounding these images. In their research, they compared the precision of Google images (32%), with that of their own fully automated system. Yao et al. [5] expanded on the study conducted by Schroff et al., adding query expansion by using Google Books Ngram Corpus (GBNC) [6]. Accordingly, not only the initially given query was searched but also its slight variations were considered. The query expansion used by Yao et al. [5] was primarily focused on using adjectives. For example, if the search query was “Zebra,” through GBNC this query would be expanded to “Young Zebra” or “Wild Zebra,” to receive more specific results. A prime example of semi-automatic dataset generation is mentioned in the research conducted by Zink [7]. In that research, initially, a smaller high-quality dataset was manually constructed, which was later used as a control set to verify the images obtained through a web scraper. Through their research, they demonstrated that the accuracy of semi-automatic dataset generation had the potential to reach that of professionally manually collected datasets. Another conclusion from their research is that when the number of images scraped increased, the accuracy also increased.
2.2 Query Expansion
Recently, there have been multiple developments in the field of semantic relations between words. The oldest approach, which is still frequently used, is WordNet [8]. This is a large English lexicon that contains many links between words, one of its most recent developments is Word2Vec. Word2Vec uses a Corpus, such as the earlier mentioned GBNC, to generate vectors that indicate relations between words [9]. As there are many similarities in their use, the comparative study conducted by Handler [10] underlines two differences between WordNet and Word2Vec. Firstly, Word2Vec can handle a much wider array of words when trained with the GBNC; however, WordNet contains a lower number of samples (100 billion compared to 116,000 examples). Secondly, Word2Vec is particularly strong at detecting, holonyms, meronyms, and hypernyms. An example of the data obtained by Word2Vec is presented in Fig. 1.
There is of course also the manual option of query expansion. With manual query expansion, the user provides some extra terms that can be used for improving the primary query. Yu et al. [
11] demonstrated that having humans assist in qualifying an image increases the quality of the later-trained dataset generated. This same principle could be applied to having a user modify an initial query before querying images obtained from the Internet.
An example Word2Vec datum demonstrating holonymy, synonymy, hypernymy, and meronomy constructed by Schwab and Lafourcade [ 12].
2.3 Web Scraping
Zhao [13] perfectly described the definition of web scraping as, “a technique to extract data from the World Wide Web (WWW) and save it to a file system or database for later retrieval or analysis.” They also mentioned the related controversies, such as causing copyright infringement and potential distributed denial-of-service (DDOS) attacks via web scraping if not executed correctly. Based on this, a large collection of web scrapers has been created for varying purposes using different techniques. Sirisuriya [14] described nine different techniques that can be used for web scraping. The most popular method is using some form of web scraping program to obtain the desired dataset. Regarding these web scraping applications, there are many different options based on the desired dataset characteristics. As it stands, there seem to be at least over 20 different commonly used and readily available web scraping software versions. Comparative studies have been conducted regarding the performance and targets of these applications. Particularly Glez-Pena et al. [3] and Sirisuriya [14] conducted thorough research on the topic. Besides using existing frameworks, there exist numerous toolkits that allow a user to create their web scraping application. This allows the user to create a specific method for their project. Upadhyay et al. [15] described how a new and robust web scraping tool could be constructed for any task. Moreover, there exist numerous examples of researchers constructing web scrapers using various frameworks and expanding them to their requirements. For example, Thomas and Mathur [2] used Scrapy as their primary tool for text-based web crawling, whereas Zink [7] expanded on the toolkit of iCrawler owing to its ease of modification and support for multiple search engines. Even though there exist various web scrapers, in the end, all of them follow the basic principle depicted in Fig. 2.
2.4 Dataset Quality
Dataset quality is significantly challenging to correctly assess automatically because it is characterized by many aspects of a dataset. Pipino et al. [16] defined 16 possible dimensions that can be used to determine the data size quality. In their research, they concluded that within these possible 16 dimensions there was no “one size fits all” solution that can be used to determine the dataset quality. However, within commercial sources [17-20] there seems to be a more conclusive answer.
Basic web scraping system.
The five main factors that have been determined from these commercial sources have been mentioned as important, but not decisive, by Pipino et al. [16] as well. The factors are listed as follows:
· Characteristic, how are the data measured?
· Accuracy, is every detail within the information correct?
· Completeness, how comprehensive is the information?
· Relevance, is this information needed?
· Timeliness, are the data up to date?
Image datasets follow these five characteristics as well; however, they add two more. Firstly, the image quality should be as high as possible. This means that the images should possess high resolution, and there should be as little use of blur or other manipulation as possible [21-23]. Secondly, the image dataset should consist of high-quality unique images. Both commercial [24,25] and academic research [20,22,26] indicate that the ideal threshold is a minimum of 1,000 images for any dataset to possess reasonable quality. The research conducted by Zink [7] also demonstrated a correlation between the number of images within a dataset and the accuracy of the image recognition dataset. Images must be unique to avoid a bias whenever the dataset gets used, which is described by both Rosebrock [27] and Hofesmann [28].
3. Discussion
Based on the concepts introduced in the previous section, we define the requirements that a fully automated image web scraper should contain.
3.1 Collecting an Image Dataset
Collecting an image dataset for any use must consider the seven characteristics that were mentioned before. When using an automated web scraper to collect an image dataset via search engines, the relevance, timeliness, quality, and reliability are taken care of by the used search engines as demonstrated by Zhang and Rui [29], which are hard to manually adjust. Completeness is something that a fully automated system cannot assess. However, as mentioned by Zink [7], if the quantity of images increases, the completeness will increase as well. When it comes to quantity, this can simply be a user-defined parameter that the web scraper tries to obtain. The main problem with achieving a high quantity is that every search engine only displays a relatively low amount. For example, Google images display a maximum of 400 results [30]. Therefore, for achieving a higher quantity of unique images, multiple search engines must be used. Considering even higher quantities of images needed for the dataset, query expansions can be used for different images. When evaluating the accuracy of the dataset, any of the current state-of-the-art neural networks can use the dataset to train and assess the scraped dataset.
3.2 Dealing with Noise
Within any dataset, there is a probability for “noise” to appear. Noise dramatically decreases the classification accuracy of the created model. Zhu and Wu [31] identifies two types of noise within datasets, class, and attribute noise, both significantly impacting the accuracy of the final model. Class noise occurs when an item in the dataset has been misclassified, whereas attribute noise occurs when a specific value corresponding to an item is incorrect. An image classification example is presented in Fig. 3, which depicts the image search results regarding the color red. In this example, the class noise is a completely wrong image whereas the attribute noise image yields a blue square.
An example of class and attribute noise that occurs when searching for images of red.
The authors of [32] compared 79 studies and techniques regarding noise identification and solving noise. Almost all considered techniques involve data preprocessing. Tableau [33] defined data preprocessing, also known as data cleaning, as the process where the data contained in a collected dataset is processed to remove any items that contain noise.
Within a fully automated environment, checking for noise is extremely difficult. This is because checking for noise in a large dataset using an algorithm requires human intervention. Within an automatically gathered image dataset, the preferred method of data preprocessing is demonstrated by Zink [7]. They used a method where a smaller high-quality dataset was used to check all images in the dataset to verify if they fit the input query. Within a fully automated system, there is no proven way to guarantee that an image is correctly classified; therefore, this method seems impossible. A potential method to create a smaller high-quality dataset is by obtaining the first few items from a search query, assuming they are without noise, as they are the best results provided by the search engine, and then generating the high-quality model to preprocess the remaining gathered images.
3.3 Regional Differences
All search engines suffer from regional differences. Country-based bias on search results, where certain search results are not shown, is proven by both Vaughan and Thelwall [34] and Mowshowitz and Kawaguchi [35]. When this is combined with language-based differences, which are more prevalent with image searches, a single query can result in a completely different set of results, as shown in Fig. 4. Based on the combination of bias and country-specific terms, search engines return different results on each query for each country.
A comparison between the search results of Gazelle initiated from Korea and the Netherlands. There is a difference as Gazelle is a famous bike brand in the Netherlands.
The main solution to this problem would be to implement a system that enforces the search query to always be constructed from a specific country using a VPN or allow the user to clarify its initial search query with supporting queries. These supporting queries would “guide” the search engine to the specific user-desired query. This can be done in both a positive and negative sense, as most modern search engines support a hyphened (-) query, which would omit certain results [36]. Based on this a user can prevent any problems caused by their location. An example, presented in Fig. 4, would be the main query of Gazelle supported by the queries of Animal and Africa so that the search engine is guaranteed to return the correct data.
3.4 Query Expansion
As discussed in Section 2.2, the most optimal way to automatically establish links between words is using a trained Word2Vec model. When a query expansion algorithm, such as Word2Vec, is used to enhance a search query, there are potentially thousands of links between each word. As mentioned earlier, the easiest links that can be established using Word2Vec are synonyms, holonyms, meronyms, and hypernyms. For image classification, a synonym should be avoided as it leads to potentially completely unrelated queries. For example, YourDictionary [37] lists the top-rated synonyms for a Labrador dog as Golden Retriever, Spaniel, and Dachshund, which are completely different dog breeds. Searching for these synonyms would experience a higher probability of noise. The search query behavior of a holonym does correctly help specify the primary query; however, it leads to a loss in image compilation. For example, Word2Vec demonstrates that one of the direct holonyms for a Labrador is the snout. Searching for a Labrador snout correctly returns images of the Labrador and its snout but it primarily returns images focused on its snout. Regarding meronyms and hypernyms, using both once- and twice-removed hypernyms yields search results leading to more detailed images of the initial query. All these examples can be seen in Fig. 5.
Examples of synonym, holonym, and hypernym expanded searches.
However, there is still a chance that hypernyms can be incorrect. This can occur when a word can belong to multiple collections. For example, Labrador is a dog but there also exists a Labrador Island, which results in an existing hypernym link to Island. If this hypernym is used for expanding the search query, it would result in incorrect data. A solution to this problem would be to use a manual secondary query provided by the user and check the link between secondary queries and the hypernym of the primary query. In this case, a user-provided primary query of Labrador could be supported by Dog and Pet as secondary queries. By testing if a link exists between the primary query’s holonyms and its secondary queries, it is possible to determine if the holonym can be used to expand the search query.
3.5 Copyright Infringement
Rappaport et al. [38] stated that even though web scraping is currently legal, laws regarding the potential copyright infringement caused by web scraping are rapidly changing. Paul [39] proposed multiple tips for limiting potential copyright infringement. For an automated web scraper for images, the main tip that can be applied is to focus on using public data and APIs to collect the data. Accordingly, a web scraper should primarily be limited to searching through search engines in a way that limits the requests sent. This is because search engines are also actively trying to remove copyrighted materials from their web page, as can be seen in Google Support [40]. Preferably the dataset collected using web scraping would be discarded after its immediate use. When generating a model, this could be implemented in the web scraping application so that the images can be immediately discarded and only the trained model persists. This would result in the lowest possible copyright infringement. If the user wishes for the image data to remain in their local storage, they must actively select an option to do so.
4. Conclusion and Future Work
In conclusion, there is a plethora of possible web scraping applications, as demonstrated in the studies conducted by Glez-Pena et al. [3] and Sirisuriya [13]. However, options for image web scraping are limited. Each research conducted on image web scraping [2,5,7] had used a different framework to construct a web scraper in a certain way to download images. None of the solutions offered in the exiting research provide a high-accuracy solution for fully automated web scraping, including query expansion. Accordingly, there is an opportunity to create a modern web scraping application that considers these requirements while implementing both query expansion and noise filtration measures.
Based on the research conducted on the web scraping field, we determined that a fully automated web scraper, such as that proposed by Yao et al. [5], can be created using both manual and automated query expansion to generate image datasets. Owing to advancements in search engine behavior, implementing a noise-checking code may not be necessary. However, if the resulting accuracy is too low, a noise management method can be applied to further improve the web scraper. The preferred web scraper, using some of the concepts discussed earlier, would attain a simple user flow, as depicted in Fig. 6.
After creating a web scraping tool that combines all these technologies, most modern machine learning models can be used to test the accuracy of the created datasets, which can in turn be compared to the results obtained by Zink [7] or image datasets already available online through ImageNet. If the corresponding results are adequate, the proposed web scraper could be made available for research purposes, while taking measures to minimize any potential copyright infringement during automated web scraping.
Potential fully automated image dataset model.