1. Introduction
One innovative service that is unprecedented in mobile augmented reality (AR) is to provide users with the required information anywhere through an Internet of Things (IoT) network. This type of service can be found in many research studies and products [1]. Enterprises have begun to use these services to increase productivity and effectiveness in the industrial field [2]. Although these services contribute to decreasing time and cost, there are many issues that need to be resolved. One of them is to continuously provide optimal information (or objects) adapted to the users’ changing requirements and constraints. Most current AR services provide text or information to users by using simple image overlay technology [3]. When using this scheme, unexpected text or images are displayed on the screens of mobile devices [4]. This happens given that augmented objects are not exactly changed in real time, as there are a number of things to be considered to meet users’ current requirements [6]. To address this issue, many tech¬nologies have been proposed. One is to provide efficient context awareness to meet users’ requirements [5,6]. The context is referred to as environmental information, which is used to characterize a place where smartphone users are located.
Other researchers attempted to address this problem by providing efficient location-based services [7,8].
To date, these solutions have contributed to improving the quality of adaptive and continuous AR experiences in mobile devices. However, there are some limitations to these studies. First, they did not consider a selection method for the optimal object to be augmented to the input video stream. To enable adaptive AR, the required objects should be changed intelligently and augmented on the screen of the mobile device in real time by adapting to user’s dynamic context. Second, they did not present a scheme for handling a large number of augmented objects. The number of augmented objects is too large to store them on a single mobile device. To overcome these problems, a real-time mobile AR framework based on object similarity is presented in this study.
The remainder of this paper is organized as follows. Section 2 discusses the design and structure of the proposed method in detail, Section 3 present an evaluation of the results, and Section 4 concludes the study.
2. Proposed Framework
The basic concept behind this approach is illustrated in Fig. 1.
Service scenario using real-time mobile AR.
The idea is to allow users acquire information using mobile devices equipped with a real-time AR framework. Whenever users capture the image of an object using their mobile camera, the AR subsystem connects to a remote server. This system has three types of servers: a metadata server, an object provider, and a control server. The metadata server contains information on each augmented object, such as its name, location, characteristics, and size. Real objects are saved in a separate server, as illustrated by the object providers in Fig. 1. Object providers can be cloud servers or third-party commercial providers. A third-party commercial provider may provide useful objects through object provision services. The third is a control server that is responsible for controlling the distribution of each object provider. The system network architecture was extended in our previous work [9,10], by adding a control server for distribution control. To realize this idea, we propose an enhanced framework, as shown in Fig. 2. The proposed framework is composed of six blocks that include object detection, object augmentation, context awareness, rendering, a neural network for context detection, and neural networks for intelligent object selection. The architecture of the proposed system was extended from our previous work [9,10]. The object detection block detects real-world objects from input images, such as faces, soccer balls, hands, automobiles, and books. The object augmentation block performs the augmentation that overlays the images or text on the detected object area of the input images. This must be applied to all consecutive video frames and processed at a user-defined speed.
The augmented data are classified into images and text. Images are processed by applying an overlay algorithm of existing image processing techniques. In the case of text, it is simply overlaid on images. The rendering block performs the role of displaying the augmented images on the screen. Finally, the context awareness block automatically searches for the user’s location, place, purpose of visit, etc., and transfers that information to the augmentation block. In the architecture, machine learning techniques are used to detect context, and to choose an object intelligently. One of the issues to be resolved in this study is to search for a target augmentation object and to transmit it to mobile devices. Hence, this study presents an effective searching scheme that is based on users’ context.
The scheme consists of four steps. First, the objects of interest are identified from the input video in a mobile device. The identified objects are defined using the object’s name and the context information. The context information includes the user’s location and the building-type being currently visited, such as school, mall, or theater. For example, suppose that a user visits a book store and captures a book on video, in the building of Kyobo. Then, the context information to be used in the augmentation object is {book, Kyobo}. Context awareness is dealt with separately. Thereafter, the name of the identified object and the context information are transmitted to a metadata server outside. The metadata server keeps the metadata of objects to be augmented and chooses an optimal object for augmentation. The server searches for the most suitable object using the received information. The search is performed in two steps. First, the server searches for all the objects, from the augmented object database, that have the same name as the requested object. The resulting objects may be more than one. Second, the server determines an optimal object for augmentation, from the objects found by analyzing the context information and the features of the input object. To select an optimal augmentation object, we propose a choice scheme that is based on context similarity of the augmentation object. The context information contains environ¬mental information of mobile users. This information should be based on some auxiliary information. The most frequently collected information for context reasoning is based on six principles, including “when”, “where”, “who”, “how”, “what”, and “why” [11,12]. Here, “when” refers to the visiting time of the users, “where” represents the place of visit, “who” denotes the users’ characteristics, “how” represents somehow, “what” represents a domain area, and “why” represents a purpose. From here, we use only “who”, “when”, and “where” to calculate augmentation context similarity. Let [TeX:] $$\mathrm{AOC}_{i}$$ represent the received context information for object i and [TeX:] $$\mathrm{AOC}_{j}$$ represent the context information of augmentation object j stored in a metadata server. Then, the context similarity between two augmentation objects can be computed using cosine similarity [13], as shown in Eq. (1).
Here, [TeX:] $$\mathrm{AOC}_{i}$$ is defined using the following equation:
The mobile user context contains information about the user operating a mobile device, which includes gender, age, and job. Each context can be calculated using the following equations:
The gender field may have values 1 (woman) or 2 (man), and the real age value will be assigned to the age field. For the job field, the overall domain value of jobs will be assigned. An example of a mobile user context is as follows. The location context contains the location information of a mobile device in the current location of the user. The information includes country, city, GPS coordinates, and type of place visited. The found object context represents information of the found object detected by the camera in the user’s mobile device. This context contains information that includes the object’s type and features. Object features include the object’s type and size. An algorithm for finding an optimal object in a metadata server is as follows:
Metadata include a server’s location, such as an IP address, where optimal objects are stored. In the fourth step, the user’s mobile device obtains an augmented object from the server using the received location information, and it is combined with the input image. The entire process for searching aug¬mentation objects is illustrated in Fig. 3.
Optimal object searching procedure.
3. Evaluation
3.1 Implementation Results
To evaluate the feasibility of the proposed approach, a prototype system was implemented. Additional software components were added and placed over the Android operating system. The Android operating system supports an application program interface (API) to implement applications. Table 1 presents a summary of the specifications of the prototype system.
Prototype system specifications
Fig. 4 illustrates the screen of the implementation results. In the screen, the basketball and shoe images are displayed on a camera preview of the smartphone; relevant details such as price, name, and material are displayed on each object. The real-time AR phone detects only basketballs and shoes. In future works, more objects need to be detected, and more functions should be implemented. In this version, small parts of the proposed framework were implemented to merely evaluate its feasibility and effectiveness.
Using the prototype system, a user experience evaluation was conducted by mobile users. Thirteen students, aged 21–30 years, participated in the experiment. A room was designed to resemble a store. In this room, a basketball, a book, and shoes were placed on a desk. Thereafter, two mobile phones were distributed to the students, and they were allowed to enter the room. One mobile phone was not equipped with real-time AR services, while the other phone was. Each subject tried to determine the retail price and to find similar products, by searching the Internet using a non-AR phone. Thereafter, they repeated the exercise using an AR-enabled phone. When a product was captured using the AR-enabled phone, images of similar products and their prices were automatically displayed on the screen. Subsequently, they were asked to answer each question in the test sheet, as displayed in Table 2.
Screen of the prototype implementation based on the proposed framework.
Given questions for user experience evaluation (UX)
The results of the experiment are shown in Table 3. Clearly, majority of the 30 subjects responded with “yes” to questions Q1, Q2, and Q5. For Q3, only 11 subjects responded affirmatively, and 19 subjects responded negatively. This is because the AR-enabled mobile phone used in the experiment was not complete; it was a prototype system. Some subjects highlighted that the AR-enabled mobile phone was interesting, but they would only use it after it was completely developed.
For question Q4, most subjects responded with a “no”, except for five students. This implies that more augmented objects need to be included in the provided servers. In reality, enough objects were not accumulated because the evaluation system was only intended for the prototype.
User experience test results for the real-time AR-enabled mobile phone
3.2 Qualitative Evaluation
To discuss the feasibility of the proposed approach qualitatively, we compared it with existing AR applications. A comparison between intelligent AR and existing AR is highlighted in Table 4.
Comparison between intelligent AR service and existing AR service
To provide AR services on a mobile phone, it is necessary to effectively store and search for information or various images to be added to the input images. Existing AR systems typically store augmented objects as a file on a mobile phone and read the file to add it to the corresponding image. This method is suitable when there are few augmented objects; however, it cannot be used if there are a large number of augmented objects. This work proposes a technique that can effectively store and search for a large number of augmentation objects based on a metadata server. Next, to enhance the accuracy of the information provided, the AR system should find an object as close as possible to the one required by the user, from a number of augmentation objects. However, existing AR applications combine input video with stored objects without considering users’ intentions and situations. In the proposed scheme, the object to be augmented is intelligently selected by considering object similarity. Furthermore, to provide context awareness, existing AR applications utilize users’ information and location. Users’ information includes personal information such as age, job, gender, and location, and their GPS coordinates. In our approach, in addition to this information, one more factor is considered that includes a detected object’s characteristics. The detected object represents an image or text detected from the input preview through object detection technology. To enhance the accuracy of the sensed context, the characteristics including object’s name, type, and features are used. For an object combining scheme, dynamic object augmen¬tation is proposed. In this scheme, a required object is flexibly combined with a detected object, while a stored image is simply overlaid on an input video frame in the existing AR. The process of finding and transferring the desired information during a video call is currently cumbersome and inconvenient because either the current call must be canceled or the screen must be switched to search mode. However, if users use the proposed scheme, they can immediately transfer the necessary information or images to the other party’s phone without going through the aforementioned steps.
4. Conclusion
To enable adaptive AR, required objects or information should be intelligently changed and augmented in real time on a preview screen of the mobile device, by adapting to the users’ dynamic contexts. To date, many solutions have been proposed to increase this adaptability by providing efficient context awareness. However, there are limitations to existing AR solutions. First, they do not consider the selection criteria for the optimal object to be augmented on the input video stream. Second, there has been no research on how to efficiently handle a large number of augmented objects. Third, there is no scheme to integrate mobile AR with session initiation protocol SIP-based video telephony. To overcome these limitations, a real-time AR framework was proposed in this work. To evaluate the feasibility of the proposed scheme, a prototype system was implemented on an Android smartphone. Using this system, a qualitative evaluation based on questionnaires was conducted, and a comparison with existing AR solutions was performed. The experimental results show that the proposed framework provides a better user experience than existing smartphones, and users are able to conveniently obtain additional infor¬mation on products or objects through fast AR services. The future work is as follows. First, more subjects need to participate in a comparison experiment to increase the reliability and validity of the evaluation. Second, an object selection scheme based on machine learning needs to be integrated to improve the accuracy of the information provided.
Acknowledgement
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2018R1D1A1B07045589).