Article Information
Corresponding Author: Chao-Hsi Huang*** (chhuang@niu.edu.tw; spurs20406@yahoo.com.tw)
Kathiravan Srinivasan*, School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India, kathiravan.srinivasan@vit.ac.in
Chuan-Yu Chang**, Dept. of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Yunlin, Taiwan, chuanyu@yuntech.edu.tw
Chao-Hsi Huang***, Dept. of Computer Science and Information Engineering, National Ilan University, Yilan City, Taiwan, chhuang@niu.edu.tw; spurs20406@yahoo.com.tw
Min-Hao Chang***, Dept. of Computer Science and Information Engineering, National Ilan University, Yilan City, Taiwan, y13uc025@lnmiit.ac.in;y13uc058@lnmiit.ac.in
Anant Sharma****, Dept. of Computer Science and Engineering, The LNM Institute of Information Technology, Jaipur, India,
Avinash Ankur****, Dept. of Computer Science and Engineering, The LNM Institute of Information Technology, Jaipur, India,
Received: November 15 2017
Revision received: December 12 2017
Accepted: January 29 2018
Published (Print): August 31 2018
Published (Electronic): August 31 2018
1. Introduction
In the present times, massive amounts of data get generated from various sources such as social media, science & technology institutions, weather organizations, corporate firms, web services, and so on. Such a swift upsurge in data volumes in various fields has captured the attention of the business, academic and scientific communities. Consequently, offering tools for storage, handling and information recovery from vast volumes of data is today among the most significant issues in information technology research [1]. To meet the escalating need for data storage, manipulation and information recovery, new data centers are being created. Moreover, the servers in these data centers guzzle a large proportion of energy resources for storing, analyzing and processing such immense quantities of data, also referred to as big data.
Usually, big data denotes an extensive assortment of enormous data volumes that it is hardly possible to handle and manipulate by employing conventional data management tools, due to its sheer size and complexity [2,3]. The term ‘big data’ was coined in 2005 by Roger Magoulas of O’Reilly media. The specific and inescapable challenges of big data include the fact that the infrastructure required for handling enormous data volumes has to be built using limited resources, and severely limited processing time-periods. Also, extracting eloquent facts and figures from such data requires the utilization of storage clusters and intricate data applications [4].
Furthermore, these applications should possess features or functionalities such as extreme scalability, data distribution, load balancing, fault tolerance, superior availability, manipulation and information retrieval. To resolve these issues, Dean and Ghemawat [5] established the MapReduce model for processing massive volumes of data on large clusters.
Apache Hadoop is an open source software approach employed in the distributed storage, handling, and analysis of significant data volumes [6]. The crux of Apache Hadoop comprises of a storage portion, referred to as Hadoop Distributed File System (HDFS), and a handling and analysis part termed MapReduce. A single-board computer (SBC) is an all-purpose computer, which is constructed on a single circuit board along with the required microprocessor(s), memory, input/output and other functionalities essential for a well-designed computer [7]. The Mobile Raspberry Pi is a less expensive SBC that can be exploited for many applications. The vital contributions of this research are the deployment of the Mobile Raspberry Pi SBC with Hadoop clusters, which provides parallel and distributed processing with augmented and robust performance. Section 2 illustrates the related work. The system design and implementation are elucidated in Section 3, while the results and discussion are elaborated in Section 4. Section 5 presents the conclusion.
2. Related Work
In recent years, the Raspberry Pi has been proclaimed to be one of the most popular single-board computers around the world. Moreover, SBC clusters have been implemented for several business, scientific and academic tasks. Raspberry Pi SBCs have the competitive advantages of being inexpensive with low energy consumption while offering the functionalities similar to an effusively built computer. Besides, many Raspberry Pi clusters have been constructed and implemented for academia and research utilities [8-11]. Abrahamsson et al. [8] constructed a Raspberry Pi cluster comprising of 300 nodes. Each node in this cluster included a Raspberry Pi Model B device and they constructed it as a low-cost and low-power-consumption cluster for green research test bed and mobile data center applications.
Cox et al. [9] built a low energy consuming, handy, relatively cheap and submissively ventilated cluster for academic purposes, named the Iridis-Pi cluster. The Iridis-Pi cluster includes 64 Raspberry Pi Model B nodes, in which every node is allocated with a 700-MHz Acorn RISC Machine (ARM) processor, a 256-MB random-access memory and a 16 GB secure digital card for local storage utilities. Kiepert [10] constructed a Beowulf cluster by deploying 32 Raspberry Pi Model B devices, a 48-port 10/100 switch, an Arch Linux ARM, and a Message Passing Interface MPICH3. Besides, the Message Passing Interface program that computes the π value using Monte Carlo technique was employed for measuring the performance of this cluster.
Tso and his colleagues constructed the Glasgow Raspberry Pi cloud comprised of 56 nodes. Each node in this cluster was made up of a Raspberry Pi Model B device along with the PiCloud software stack built over Linux containers for system virtualization. The PiCloud imitates each layer of a cloud stack that varies from virtualization of resources to the behavior of the networks. Additionally, this framework offers an ambiance for academic and cloud computing research [11]. Schot [12] constructed a Raspberry Pi cluster as a substitute for various 1U rack servers. Also, his purpose was to build a lowcost, remarkably synchronized, small-energy and less-expensive ventilated cluster with typical rack servers to be used for a big data setting. Kaewkasi and Srisuruk [13] constructed a model with Hadoop clusters for processing big data, which was built over 22 ARM devices. Subsequently, Hadoop’s MapReduce was substituted by Spark, and their research was to study the power consumption and the input/output performance of the hardware. Qureshi et al. [14] used the inexpensive Hadoop clusters to analyze the performance of a cloud computing conceptualization for employing several applications in computer vision and robotics. Hajji and Tso [15] established their research by building a cloud model based on Raspberry Pi for investigating and analyzing real-time big data in different ambient scenarios. Morabito [16] built a model to assess the performance of various low-power gadgets for managing container virtualization.
Compared to the existing frameworks, this research focuses on evaluating the performance of the designed system under realistic conditions. The clusters are formed using the Raspberry Pi third generation units combined with Apache Hadoop to efficiently compute the big data generated. As compared to previous methods, the combination of Hadoop and Raspberry Pi 3 has all the qualities that various methods sometimes failed to adapt. The experiment is easily scalable, cheap, secure, and has high fault tolerance. The low cost of the Raspberry Pi clusters and the scalable Hadoop environment can be used for a variety of complex operations in this era of big data.
3. Background and System Description
In this section, the details about the components such as the SBC, Hadoop, and SURF (Speed Up Robust Features) algorithm are presented.
3.1 Single Board Computer
In general, an SBC is a concise and extensive computer assembled on a single circuit board with microprocessor(s), memory, input/output ports and other functional modules necessary for a full purpose computer. In the present scenario, smart devices and gadgets such as Raspberry Pi, cell phone, Arduino, tablet and notebook computers are some of the commonly available single-board computers. In comparison with desktop personal computers, an SBC is not dependent on expansion slots for peripheral functions. Currently, the jargon SBC is synonymous with an architecture in which the SBC is plugged into a backplane for fabricating a computer bus. Utilization of an SBC lessens the total cost by decreasing the number of circuit boards required and by also eliminating the need for bus driver circuits and connectors.
3.2 Hadoop
Hadoop is an open source software framework built by Apache. It uses a map-reduce programming model to process big data sets and manage distributed storage. They are built using commodity hardware around the assumption that hardware failures are a common occurrence and should be handled by the framework automatically.
HDFS which handles the storage part is the core of Apache Hadoop, and the map reduced programming model handles the processing part. All the files are split into large blocks in the Hadoop model, which are then distributed among multiple nodes of the cluster. The data is then processed parallelly in these nodes using transferred packaged codes. In this method, each node manipulates and processes the data that it has access to locally. Taking advantage of this data locality allows the processing of the dataset to be highly efficient and faster compared to the traditional supercomputer architecture using the parallel file system. In the conventional architecture, the computation and the data distribution are done via high-speed network, whereas in the Hadoop architecture they are accomplished over the same node.
The Hadoop framework [19] consists of multiple modules like YARN, HDFS, and so on. The libraries and the utilities required by the other Hadoop modules are contained in the Hadoop Common Module. The YARN provides us with the resource management platform capable of managing computing resources across the cluster, allowing them to be used efficiently for user application scheduling. A distributed file-system known as HDFS stores the data on the node. Further, this allows a high aggregated bandwidth across the cluster. The Hadoop MapReduce module is the map-reduce programming model for large-scale data processing. The architecture of two different versions of Hadoop can be seen in Fig. 1.
The architectural difference between Hadoop 1.0 and Hadoop 2.0.
3.3 OpenCV
OpenCV is the Intel developed visual function which provides a large number of functions for image processing. The acronym OpenCV refers to Open Source Computer Vision library, and it has some C and C++ functions that help in implementing standard procedures and algorithms for image processing and computer vision. Some features are based upon state-of-the-art papers, and using them directly instead of reimplementation saves much time. The OpenCV codes are optimized, leading to efficient processing. SURF is the commonly used algorithm implementation of OpenCV to capture image feature points, as explored below.
3.3.1 SURF algorithm
SURF, which is an improvement to the SIFT’s functions, was proposed by Bay et al. [17] in 2006. The processing speed is faster compared to the SIFT algorithm, since the computations are significantly reduced. The algorithm is divided into 6 parts.
i) Integral images: The use of integral images is the main reason for the overall performance improvement of the algorithm. Only four addition operators are used to calculate the sum of grayscale values in any rectangle. The Box Filter is used to make the calculations faster.
ii) Hessian matrix based interest points: The roots of Hessian matrix are used to detect the feature points in the SURF algorithm. The detection candidate point is positioned more stably as the determinant is used to determine the local extrema. The Hessian matrix H(χ, σ) is defined in Eq. (4) for an input image Iwith point χ and the scale σ.
where Lxx(χ, σ) a Gaussian function and Box Filter is used to approximate a Gaussian quadratic differential operation. As a result, Lxx(χ, σ) use DoG faster as compared to the SIFT algorithm.
iii) Scale space representation: The input image is convoluted with the Gaussian function, then reduced and then again convoluted with the Gaussian function. As a result, the SURF algorithm does not change the size of the original image. Further, to achieve different scale changes the appropriate filter size is used, making the algorithm independent of information from the previous layer.
iv) Interest point localization: This part of the algorithm is similar to that of the SIFT, except that the comparative value used here is the determinant of the approximate Hessian matrix. The value is obtained through Eq. (5).
v) Orientation assignment: The orientation of the point of interest is found to achieve rotational invariance. Moreover, keeping the candidate point as the center, the Haar wavelet responses are found within a circular neighborhood of radius six scale. The Gaussian function then weights the responses. The dominant orientation is estimated by summing the horizontal and vertical responses within a sliding orientation window of size π/3. The two summed responses yield the local orientation vector, and the most extended such vector defines the orientation of the point of interest.
vi) Descriptor based on the sum of Haar wavelet responses: A square region of size 20 scale is extracted keeping the point-of-interest as center and orientation as found in the previous steps. This region is further split into 4×4 sized subregions, and the Haar wavelet responses are extracted at 5×5 regularly spaced sample points for each subregion. The descriptor is thus represented as a vector representation of the dimensions.
4. System Design and Implementation
In this section, we introduce the method and the experimental setup used for the research. First, the “Development Tools and System” section explains the reason for the selection, and then the parallel cluster setup is explained in the next two sections. Finally, the “Image Feature Point Processing” section explains how to determine if the image feature points are similar.
4.1 Development Tools and System
Eclipse LUNA is used to develop the Hadoop MapReduce calculation program. MapReduce is a java code written using Eclipse development platform and then exported to a JAR file. This file is then sent to Hadoop run.
4.2 Experimental Equipment and System Platform
We use several sets of Mobile Raspberry Pi to build several Hadoop clusters. For each cluster unit, we can compare the image feature points. First, we create the Hadoop platform to calculate the massive image feature points, and then we observe the effects of these calculations on each cluster. Table 1 below shows the experimental setup used in the process.
Details about the experimental setup used in the process
4.2.1 Setting the cluster network environment
In Hadoop, one master is used to control all the nodes on the DataNode and TaskTracker, as Hadoop relies on SSH for communication with other nodes and data transmission. A LAN setup connects all the nodes, and only the master can communicate outside the network. This design helps avoid exposure of all the nodes on a public network and also provides a faster transmission rate. Table 2 displays an example of the network configuration used for a cluster.
Example of network configuration for a cluster
4.3 Hadoop Settings
While establishing the Hadoop cluster units, some of the necessary parameters of the nodes and the master should be the same, while parameters such as memory settings, CPU usage, and so on can be different. Each slave node reports for the resources to the master, which further assigns the unified workload. The main profiles in Hadoop 2.X include Hadoop-env.sh, mapred-env.sh, slaves and yarnenv. sh, which can be adjusted according to individual needs and the hardware. The primary setting includes Java path, Java memory use, run file path, several resource configuration cutting, LOG record file location, LOG memory use, computing resource usage and MapReduce memory usage and slaves number name, and so on. Most of these settings are configured by the master, while slaves only configure a few settings including core-site.xml, HDFS-site.xml, YARN-site.xml and mapred-site.xml, as explained further.
4.3.1 Core-site setting
The primary purpose of this parameter file is to specify the NameNode hostname and the network traffic port number used, which is set to 9,000 for this research. Fig. 2 shows the details regarding the other parameter setting of core-site.xml.
Code snippet for core-site.xml.
4.3.2 HDFS-site setting
This parameter file controls the HDFS host-node name storage path, data storage path, and so on as shown in Table 3. We set the master to HDFS, master communication port to 50070, and the recovery NameNode communication port to 50090. The file copy number is set to three parts that divides each portion of the block to be copied into three parts. The size of each block is set to 64 MB, which helps prevent wastage of space. The speed of operation is not affected due to the above parameter adjustments. Fig. 3 shows the details regarding the other parameter settings of hdfs-site.xml.
Some parameters and their details for hdfs-site.xml
Code snippet for hdfs-site.xml.
4.3.3 Mapred-site setting
This XML parameter file manages the use of some parameters as shown in Table 4 and the number of resources used by map-reduce at the time of execution. Usually, the map and reduce operations do not start at the same time. The reduce operation usually starts after completion of up to 5% of the map operation. Fig. 4 shows the details regarding the other parameter settings of mapred-site.xml.
Code snippet for mapred-site.xml.
Some parameters and their details for mapred-site.xml
4.3.4 Yarn-site setting
This parameter file controls the adjustment of the usage of the resources as well as the some of the virtual memory, NodeManager (NM) related parameters and so on, as shown in Table 5. The virtual CPU number setting is divided into ResourceManager (RM) with relevant parameters. The communication port of RM and NM is set to 8025 in this experimental setup. The port 8050 is set for the client and RM communication purposes that are used to submit the job and kill the tasks.
Some parameters and their details for yarn-site.xml
The number of virtual cores used by the entire node is adjusted to the same number with the core of physical value, since the Raspberry core clock is not high. We do not change the lower and the upper limit values on the memory usage of a single task, which are 1024 MB and 8192 MB, respectively. The other parameter settings can be seen in Fig. 5.
Code snippet for yarn-site.xml.
4.4 Image Feature Point Processing 4.4.1 Feature point extraction
We use SURF algorithm on the input image in OpenCV to extract the feature points of the image. The matrix will consist of N X 64 feature point values, where N is the number of feature points. Each feature point is represented as a 2-dimension array that stores the comparison values at the beginning of each feature point corresponding to the N X 64 matrix and the vector values taken for each feature point, as shown in Fig. 6.
SURF feature point matrix.
4.4.2 Scale distance algorithm
Each feature point constitutes 16 subregions, and each subregion contains four values. Thus each feature point is a 16 X 4 = 64 dimensional vector. We compare the feature points that are similar with regard to scale distance. A threshold value is used to judge the similarity of two feature points. If the feature points are not within the threshold value, we do not compare them, as shown in Fig. 7.
If we do not set the threshold value, there might be a possibility that the algorithm ends up comparing values that are relatively close to the scale but not within the threshold. Typically, the smaller the threshold value, the more stringent the model will be, but if the value is too small it may lead to redundancy, since two images of the same thing will not have the same value of the feature point.
The feature-point scale algorithm flow chart.
4.5 Performance Evaluation
We first use OpenCV to extract all the image feature points and then store them in a text format with the help of Hadoop HDFS. The Map-Reduce program runs the scale distance algorithm in the map block that counts all the matching samples and the source images at the scale of the feature points. The information regarding the samples and the source images is quite similar. This information is passed to the reduce block for the final screening operation and is finally compared, to return the sample number and a similar number of points.
We observe the average time taken by the clusters of Raspberry Pi and also by the single PC, to evaluate performance.
5. Results and Discussion
The experiment is set up under IDE with an efficient map-reduce environment to output a java executable file which is then passed on to the Hadoop of the master host. Further, to set up the Hadoop environment, we pass the SSH command through the Linux terminal to confirm the connection between the nodes. Then we use the Linux terminal to issue other necessary instructions to operate Hadoop through the master node.
5.1 Hadoop Setup
The NameNode format instruction is run at the master node to compete for the Hadoop settings and count the number of nodes in the cluster through SSH communication so that the nodes can create a better HDFS folder. Moreover, to complete the process, the master node must start the decentralized file storage system.
After completing the above process, the image feature point samples are imported into Hadoop’s decentralized storage system. First, we start the Hadoop HDFS by using start-dfs.sh in the master terminal. Then the nodes are connected through SSH, and the master can view the HDFS information via dfsadmin-report. Finally, the master node issues the hodoopdfs-put command to upload all the information to the distributed storage system.
The node terminal issued JPS instruction can see the operation of the NodeManager while the master terminal issued JPS instruction can see the operation of the ResourceManager. Further, this means that the Hadoop YARN system is active and running smoothly. We can also use the YARN browser interface to view the information of the cluster nodes and also observe the MapReduce jobs running through this interface.
5.2 Experimental Results
After completing the above Hadoop cluster setup, the MapReduce program is used to identify the feature points of the image and find the image similar to the input data, to test the overall performance. Table 6 shows the details about the characteristics of the HDFS data file.
Details about the characteristics of the HDFS data file
5.2.1 Single computer performance
We use i5-4440, 3.1 GHz CPU with 8 GB memory for this experimental setup and calculate the time consumed per 100 operations, as shown in Figs. 8, 9 and Table 7, using the total operation time. From the experimental results, we can observe that the amount of data does not affect processing efficiency, and the values of average time consumed are very close to each other.
Graph showing total average time consumed by a single computer for different classes of dataset.
Graph showing average time consumed by a single computer per 100 datasets for different classes of dataset.
The average range of time consumed by a single computer per 100 data sets for each data test group
5.2.2 Raspberry Pi cluster performance
We set up 6 sets of Mobile Raspberry Pi Hadoop clusters, each containing 5, 6, 7, 8, 9, 10 sets of Raspberry Pi nodes. Similar to the single computer, we observe the changes in the volumes of data and their influence on the performance of the nodes for each cluster as shown in Figs. 10–15 and Tables 8–13.
Graph showing average time consumed by a 5 node cluster for different classes of dataset.
The average range of time consumed by a 5 node cluster per 100 datasets for each data test group
Graph showing average time consumed by a 6 node cluster for different classes of data set.
The average range of time consumed by a 6 node cluster per 100 datasets for each data test group
Graph showing average time consumed by a 7 node cluster for different classes of dataset.
The average range of time consumed by a 7 node cluster per 100 datasets for each data test group
Graph showing average time consumed by an 8 node cluster for different classes of dataset.
The average range of time consumed by a 8 node cluster per 100 datasets for each data test group
Graph showing average time consumed by a 9 node cluster for different classes of dataset.
The average range of time consumed by a 9 node cluster per 100 datasets for each data test group
Graph showing average time consumed by a 10 node cluster for different classes of dataset.
The average range of time consumed by a 10 node cluster per 100 datasets for each data test group
Graph showing a comparison of total average time consumed by each node cluster for different classes.
We can observe in the Figs. 16 and 17 that the performance of a cluster with ten nodes is about 90% higher than that of the cluster with 5 nodes. The computation time decreases as the number of nodes increases. Further, this proves that Hadoop treats all nodes as individuals and does not share the individual resources. For example, with 128,000 data test groups, each additional cluster node will enhance the performance by 10% more than the previous arrangement. If the data is too small, increasing the number of nodes in the cluster will not significantly impact the computing efficiency.
Graph showing a comparison of average time consumed by each node cluster per 100 datasets for different classes of dataset.
Graph showing a comparison of total average time consumed by a 10 node cluster and by a single computer for different classes of dataset.
5.2.3 Performance comparison
We compare the computing performance for the single computer and the 10-node Raspberry Pi cluster for 128,000 data test groups. A single computer takes about 1.41 seconds to compute 100 feature information of an image whereas the Raspberry Pi cluster takes only 1.13 seconds. In a similar condition with 16,000 data test groups, a single computer takes about 1.4 seconds, while the raspberry pi cluster computes in 1.6 seconds. The detailed statistics can be seen in Fig. 18.
6. Conclusion
We use Apache Hadoop combined with the Raspberry Pi cluster to compute the big data generated via feature point extraction on an image. The MapReduce operation in Hadoop and the salient features of Raspberry Pi 3 helps us perform the experiment more efficiently compared to a high processing single computer, as shown in the study. However, the same cannot be said when we deal with a smaller dataset.
The Hadoop provides us the flexibility to quickly scale the setup for handling even more massive datasets. The HDFS storage system used by the Hadoop provides us with better security and higher fault tolerance when compared to single host storage. HDFS breaks the data into many blocks scattered over various nodes, which make it difficult to interpret the contents of information from a single node, thus providing better security for the data.
The low cost of the Raspberry Pi clusters and the scalable Hadoop environment can be used for a variety of complex operations in this era of big data.