Sung-Bong Jang and Young-Woong Ko*Delivering Augmented Information in a Session Initiation Protocol-Based Video Telephony Using Real-Time ARAbstract: Online video telephony systems have been increasingly used in several industrial areas because of coronavirus disease 2019 (COVID-19) spread. The existing session initiation protocol (SIP)-based video call system is being usefully utilized, however, there is a limitation that it is very inconvenient for users to transmit additional information during conversation to the other party in real time. To overcome this problem, an enhanced scheme is presented based on augmented real-time reality (AR). In this scheme, augmented information is automatically searched from the Internet and displayed on the user’s device during video telephony. The proposed approach was qualitatively evaluated by comparing it with other conferencing systems. Furthermore, to evaluate the feasibility of the approach, we implemented a simple network application that can generate SIP call requests and answer with AR object pre-fetching. Using this application, the call setup time was measured and compared between the original SIP and pre-fetching schemes. The advantage of this approach is that it can increase the convenience of a user’s mobile phone by providing a way to automatically deliver the required text or images to the receiving side. Keywords: Augmented Information Delivery , Augmented Reality , Session Initiation Protocol , Video Telephony 1. IntroductionThe spread of coronavirus disease 2019 (COVID-19) is rapidly changing the way of doing things into no-contact mode in a wide area of industries [1]. Online conferencing systems based on video telephony, such as Zoom or Webex, have become indispensable tools in most industry areas [2]. The session initiation protocol (SIP) is a set of procedures defined by the Internet Engineering Task Force (IETF) to support seamless video communication [3]. The procedures are divided into four categories: call setup, call connection, traffic exchange, and tear down [4]. The existing SIP-based video call system is being efficiently utilized, but there is a limitation that it is very inconvenient for users to transmit additional information of a specific object to the other party in real time [5]. Recently, numerous studies related to video conferencing have been actively conducted. Cavaleri et al. [6] presented a scheme based on a video conferencing system to install, maintain, and train factory equipment without contacts during COVID-19. By using this system, workers can avoid infection while improving efficiency. Gupta [7] described a hybrid teaching approach mixed with online video lectures for students who majored in fashion designs to cope with COVID-19. The biggest advantage of this method is that students can obtain the lecture material they need repeatedly at any time without direct contact. Portillo and Alvarado [8] described their experiences from a case study in which they conducted classes with the help of video conferencing systems to continue teaching in the COVID-19 pandemic situation. In this study, they conducted an experiment to measure learners’ feelings and motivation when using distance learning tools. The results confirmed that they can attain the objectives of the courses through online conferencing systems. Melenli and Topkaya [9] presented a system to efficiently identify faces from input streaming frames during online video conferencing systems to improve social distance during the COVID-19 pandemic. In this system, a scheme based on big data for people's distance calculations was proposed. To implement the systems, an OpenCV vision stack and hierarchical distributed file system (HDFS) were used. Spathis and Dey [10] insist that education in the world needs to be shifted from conversational methods to indirect online teaching because of COVID-19. In particular, universities are adopting video-teaching platforms as a favorite choice. In this study, they compared the differences in attention paid by the students between face-to-face and non-face-to-face classes using Zoom. The results showed that there was almost no difference. Mamone et al. [11] presented an alternative method based on augmented reality (AR) in the surgical process. When using this method, a surgeon can obtain a more natural view of the operation because perspective change is not required. They evaluated it by comparing it with another transparent window system. The experimental results proved that the proposed approach can be a good alternative to a head-mounted display (HMD)-based operation system. Si-Mohammed et al. [12] discussed interesting research in which a brain-computer interface (BCI) was combined with AR. BCI is an interface in which computers and devices are controlled by brain signals without moving the mouse, keyboard, or touch. The purpose of this was to help the disabled use computers. In this research, they evaluated the feasibility of BCI-based AR by combining a BCI device with an AR-enabled HMD device. The results showed that both devices are compatible and operable. The advantage is that not much movement is necessary when controlling the HMD device. Marto et al. [13] evaluated the impact on an already perceived sense when using AR. Here, perceived senses include delight, personal experience, attendance, and knowledge. The experimental results reveal that the scores of delights and knowledge have improved, but the personal experience score was not affected. Cejka et al. [14] discussed the adoption of AR to underwater historical heritage to improve exciting experiences of diving tourists. In this research, they proposed two types of AR solutions: marker-based and acoustic-based systems. They performed a qualitative evaluation in which 60 divers were asked to give scores to AR-based and non-AR systems. Experiments showed that the diver’s experience was improved when using the system. In addition to this research, there are certain studies related to SIP video conferencing systems. However, most studies have focused on securities [3,15]. The limitations of the aforementioned studies are as follows. First, if additional information (e.g., the price of a book) needs to be sent during an SIP video call, users must find it and reply or send it separately through chatting. Second, when there is a lot of additional information, it takes a long time to transmit them simultaneously. Third, in order to search for additional information during the SIP call, the user has to end the currently used SIP program and start the program again later. To resolve these issues, this study proposes an AR-based technique during an SIP-based video call that automatically displays augmented information on a remote video. AR is attracting great attention because it can be employed in several areas. Researchers are actively investigating on how to combine them with traditional services. The contributions of this study are as follows. First, by using AR in SIP-based video calls, a technique that allows users to easily deliver the necessary information, is presented. Second, an AR object encoding technique based on a real time protocol (RTP) is presented to efficiently transmit the information. Third, a prefetching technique was presented to reduce the transmission time required to deliver augmented objects during a video call. In particular, in a situation where non-face-to-face meetings increase due to the corona epidemic, SIP-based AR delivery is expected to be very useful. The remainder of this paper is organized as follows. Section 2 describes the proposed scheme in detail. In Section 3, an evaluation is presented. Finally, Section 4 concludes the paper. 2. Proposed FrameworkThe internal structure of an efficient system (software) that supports this method is shown in Fig. 1. In Fig. 1, the blocks on the left refer to the intelligent mobile augmented reality (IMAR) framework. As the IMAR framework was presented in a previous study [16], a detailed description of it is not provided here. The green portion on the right in the figure depicts the international standard protocol established by the IETF for video calls. To enable intelligent augmented reality (IAR)-based video calls, video telephony augmented reality (VTAR) blocks were added, which play the role of video call control between IMAR blocks and SIP protocol stacks. The details of these roles are as follows: First, it performs the SIP protocol call connection control according to the predefined AR setting. The call connection was performed based on the procedures specified in the RFC3261 international standard. To efficiently support AR object transmission, the call connection procedure was changed slightly. In the proposed system, the block overlays the input image over the augmented object before sending the image, followed by encoding with a video codec (e.g., H.264). The encoded video is transmitted to the receiving end using a RTP. The implementation of this method is not difficult, but this scheme can cause substantial delays during video calls. To solve this problem, a method for separately sending the camera input image and augmented object from the transmission end is proposed. The procedure of the proposed method is described as follows. In the first step, after the SIP call connection is completed (after receiving SIP 200 OK or ACK), the method checks whether the AR-based video call is feasible on the transmission and receiving ends. If the AR-based video call is not possible either on the transmission or the receiving end, a video call is performed without AR. If possible on both sides, the second step is performed. In the second step, the transmission side receives metadata information about the augmented object from the metadata registry (MDR) server and delivers it to the receiving end. Metadata includes the name of the augmented data and the location of the server (Internet Protocol [IP] address) where the augmented object is stored. In the third step, the receiving end receives the augmented object data by using the location of the server where the augmented object is stored, which is included in the metadata information of the augmented object. The advantage of the proposed method is that it reduces the time required to overlay the augmented objects on the transmission side. The augmented object is encoded by a RTP packet, as shown in Fig. 2(a). The structure of the packet is shown in Fig. 2(b). The descriptions of each field are as follows. The RTP header is a standard 16-bit starting sequence number specified in RFC 3550. The SIP_AROBJ_TX_START_FLAG field indicates the beginning of the message to transfer the augmented object, and its value is set to be 0xF8 (0B1111 1000). SIP_AROBJ_TX_END_FLAG indicates the end of the message and uses the same 0xF8 value as the end flag. The SIP_AROBJ_TX_ID field contains a unique identification value for the current message. The identification value is an 8-bit value, and it is created and then set using a random number in the device that initially generated the message. SIP_AROBJ_TYPE indicates the type of data that the transmitted message includes. Message types were divided into four categories. The currently sent message is the AR augmented object search request, the searched AR augmented object request, the AR augmented object search request response, and the response to the searched AR augmented object request. Values of 0, 1, 2, and 3 are set and used for each value. If a new type is to be added, the value can be expanded and used. The PAYLOAD field contains data transmitted from the sender. For example, in the case of an AR augmented object search request message, an object name and context information are included, and in the case of a response message, the augmented object name and information about the server containing it are included. Second, the VTAR block controls the SIP call setup procedures between user agent client (UAC) and user agent server (UAS). A special control is not required if the call connection is made without AR; however, for an AR-based video call, modification of the procedures is required to transmit augmented objects to a mobile device. Two methods can be used to achieve this. The first method is to send an overlaid augmented object combined with an input image from the transmitting side. In the second method, the receiving side takes the augmented object separately, combines it with the video, and displays it to the user. In addition, the VTAR block controls the turning of the AR function on and off during a video call setup according to the mobile device interface. AR on and off can be done by adding a dedicated button on the mobile phone or by activating the AR video call by pressing an icon on the video call screen. If AR is activated, the call should be connected according to the changed SIP call connection procedure, as shown in Fig. 3. In the modified call procedure, the AR enable indication is delivered using a session description protocol (SDP) capability exchange message. When using this scheme, it takes more time to set up video calls because it requires an AR object request and reception at the sending side. To decrease the time, a pre-fetching scheme is presented in our work. In the scheme, objects to be augmented are determined during the call setup time, and these are transferred to a mobile device in advance. By doing this, users feel that the objects are augmented in real time. As mentioned earlier, the call setup time for the proposed system is longer than that for the original SIP setup time. The call setup time can be calculated using Eq. (1).
(1)[TeX:] $$\text { SIP }_{\text {call_setup_time }}=I N V I T E_{t} * \text { INVITE } E_{n}+O K_{t} * \text { OK }_{n}+\text { Object_Augmentation }_{t}$$Here, INVITEt represents the time to transmit the SIP INVITE message and [TeX:] $$I N V I T E_{n}$$ represents the number of the INVITE messages to be transmitted to setup the call. The reason that the number is more than one because retransmission occurs to recover the lost packet when the network condition becomes worse. In Eq. (1), [TeX:] $$O K_{t}$$ represents the time to transmit the call connection (OK) message and [TeX:] $$O K_{n}$$ is the number of retransmitted OK messages. In the same way as INVITE, the OK message is retransmitted repeatedly to recover the lost packet when the network condition worsens. Object_Augmentationt represents the time required to augment a target object to the sending video frame. Traffic distribution can be modeled in several ways. One way is to model it using an exponential distribution, as shown in Eqs. (2) and (3).
Eq. (2) represents the probability distribution function, and Eq. (3) shows a cumulative distribution function to model the AR-based SIP call setup traffic. Here, c is a local parameter of the density function. The parameter function a(x) is added to consider the time taken to augment the target object. The VTAR block controls the changed call connection procedure to be performed when a user makes an AR-based video call. The proposed idea can be applied to search for the required products in commercial stores. For example, consider a bookstore. Currently, a majority of people feel that it is considerably difficult for them to locate the preferred book because there are numerous books in a store. Moreover, book stores aim to accumulate as many books as possible in a single store. A few stores provide a separate search system to help customers find books easily. However, users of this system may feel uncomfortable and require a significantly long period of time because they are unfamiliar with the system. Furthermore, to look at the abstract or summary of the book, they have to open each book or view it on the Internet. Occasionally, the required book may be located on a high shelf; in such scenarios, if store clerks are unavailable, the book cannot be accessed. The proposed system could be a suitable solution for such inconvenient situations. Future work will involve implementing the entire system and performing the complete experiment. 3. EvaluationTo discuss the feasibility of the proposed approach, our solution is qualitatively compared with other video communication schemes. In particular, it is indispensable in the medical and educational areas. Representative video applications in the medical industry are telemedicine, and Zooms and Webex are widely used in educational areas. Additional information transmission is not indispensable in these area. Table 1 presents the results of the comparison. From the point of view of additional transfer content, all solutions support text and image transfer during peer-to-peer video communication. Interestingly, the telemedicine framework can transmit biodata to a central server in real time. In the solution, the biodata of patients is collected, such as pressure, temperature, and pulse to check status, encapsulated, and transferred together with video data. All other solutions support transfer of text, images, and additional file data. For the signaling protocol, Zoom and Webex used proprietary protocols, meaning that they created own protocols that can be used for setting and tearing down video calls. In this case, it is difficult to guarantee compatibility with other video communication systems. The telemedicine framework uses the H.245 video protocol specified by ITU-T (International Telecommunication Union-Telecommunication Standardization Sector). The protocol is specified as part of 3G-324M, which is specified by the 3GPP (third-generation partnership project) standard group. In the proposed approach, it is assumed that RFC3261 is used to handle the call setup and tear down. This is specified by the IETF group. The decision regarding which one is better for the transfer of additional content depends on the objective of the applications. For mobile video calls, 3G-324M is preferred; however, in the wired network, RFC3261 is better. One of the disadvantages of using standard protocols such as RFC3261 is that it is vulnerable to security. Table 1.
For traffic control, the telemedicine framework uses the H.223 protocol and the RTP. H.223 is used to multiplex and de-multiplex audio, video, and data to synchronize with each other. The RTP is used to deliver multiplexed packets to the receiver in real time. The main function of the protocol is to recover the lost or error packets. The SIP protocol uses the RTP specification to deliver video and audio traffic in the same manner. The difference is that SIP does not specify a separate multiplexing protocol. Zoom and Webex developed a proprietary protocol for transferring video and audio data. For the protocol of additional data transfer, telemedicine uses H.245 video protocols. In the proposed approach, RTP and SDP protocols were used. The header of the SDP packet was longer than that of H.223. Zoom and Webex used a HTTPS to transmit additional data. In this protocol, data are encoded using HTTP and encrypted using transport layer security (TLS) to enhance security. For context provision, only the proposed approach supports it by providing an IAR. In addition, the telemedicine framework partially supports context-based data transfer through the H.245 protocol. In this system, location-based context information is provided with biomedical data and used to find a patient’s location mainly by emergency ambulance or doctor helicopter to immediately respond to a patient’s emergency. Zoom and Webex do not support this function because these systems are developed to conduct online meetings rather than emergency situations. To quantitatively evaluate the feasibility, we conducted an experiment to measure the call setup time for conventional SIP and AR object pre-fetch. In the AR object prefetching scheme, the augmented objects are taken in advance from the object store server. Thus, it can reduce the time spent on data augmentation during video communication. For this experiment, we used three desktop computers, PC1, PC2, and PC3, where PC1 and PC2 were used as SIP UAC and UAS, respectively, and PC3 as the AR object storing server. Details of the specifications are presented in Table 2. Table 2.
In this experiment, we first attempted to use an open-source SIP stack of version 3.0, which is widely used in the public domain to decrease the implementation time. However, it consumes a significant amount of time to change the stack and integrate AR object requests and transmissions. Therefore, we implemented a simple network application using a socket interface. To simplify the experiment, only INVITE and 200 OK messages were implemented, and other messages, such as 180 RINGING, ACK, and BYE are not implemented in the application. In addition, we used a peer-to-peer call setup scheme where a proxy server is not involved. The test scenario is shown in Fig. 4. Furthermore, to simulate AR object request and transmission, we implemented a simple application where the UAC requests AR objects and receives the object from the AR object server. Here, the used objects are not real objects but only simple images whose size ranges from VGA size to large HD. The images were stored on a separate server and waiting for the client’s request to send them. If the server receives a request, it randomly transmits one image file to the receiver. Therefore, the main delay is the transmission time of the requested object (stored images). In this experiment, we did not implement the entire system because it consumes a large amount of time. Rather, we attempted to find out its feasibility. The experimental network is shown in Fig. 5. Using this application, we measured the SIP call setup time. The measurements were divided into two categories. First, we measured the call setup time using the application without AR object prefetching. Next, we checked the time using the application with AR object prefetching. The AR object to be pre-fetched is chosen randomly; therefore, the transmission time of the object depends on the image size. The call setup time results are presented in Fig. 6. When calculating the results, sleep time was distracted from the total call setup time. For example, for 50 calls in the conventional SIP call scheme, 50 (ms) are distracted from the total call setup time of 55.21 (ms). Consequently, the final total call setup time is 5.21 (ms). From the results, we observe that the proposed scheme has a longer call setup time than that of the conventional scheme. This is natural because before setting up the SIP call, it sends an additional request for AR objects and receives the requested object. When the AR object store server receives a request, it randomly chooses one image (object) from the image file stores. Also, we observe that the call setup time of object prefetching scheme jumps from 29.23 (ms) to 46.65 (ms) for number of calls between 200 and 250. The reason for this is that if a large image is selected at the object store server, the time taken to receive it increases dramatically. 4. ConclusionCovid-19 and climate change are changing the human society, and people have to adapt them to the changing environment to survive. People will be painful, but, nature will be revived in proportional to the pain. One of the ways to reduce carbon emission is to minimize face-to-face contact between people. To do this, online conferencing system shall become more and more enormously used in wide areas. In this study, we have proposed a new scheme based on AR to overcome inconvenience of additional information delivery in SIP-based video calls. We have qualitatively evaluated the approach by comparing with other solutions, and quantitatively by measuring call setup time using a simple call generation application. The future work is to implement the whole system to fully evaluate the approach. BiographySung-Bong Janghttps://orcid.org/0000-0003-3187-6585He received his B.S., M.S., and Ph.D. degrees from Korea University, Seoul, Korea in 1997, 1999, and 2010, respectively. He worked at the Mobile Handset R&D Center, LG Electronics from 1999 to 2012. Currently, he is an associate professor in the Department of Industry-Academy, Kumoh National Institute of Technology in Korea. His interests include augmented reality, big data privacy, prediction based on artificial neural networks. BiographyYoung-Woong Kohttps://orcid.org/0000-0002-6292-0799He received both a M.S. and Ph.D. in computer science from Korea University, Seoul, Korea, in 1999 and 2003 respectively. He is now a professor in Department of Com-puter engineering, Hallym University in Korea. His research interests include operating systems, embedded systems and multimedia systems. References
|