# Delivering Augmented Information in a Session Initiation Protocol-Based Video Telephony Using Real-Time AR

Sung-Bong Jang and Young-Woong Ko*

## Abstract

Abstract: Online video telephony systems have been increasingly used in several industrial areas because of coronavirus disease 2019 (COVID-19) spread. The existing session initiation protocol (SIP)-based video call system is being usefully utilized, however, there is a limitation that it is very inconvenient for users to transmit additional information during conversation to the other party in real time. To overcome this problem, an enhanced scheme is presented based on augmented real-time reality (AR). In this scheme, augmented information is automatically searched from the Internet and displayed on the user’s device during video telephony. The proposed approach was qualitatively evaluated by comparing it with other conferencing systems. Furthermore, to evaluate the feasibility of the approach, we implemented a simple network application that can generate SIP call requests and answer with AR object pre-fetching. Using this application, the call setup time was measured and compared between the original SIP and pre-fetching schemes. The advantage of this approach is that it can increase the convenience of a user’s mobile phone by providing a way to automatically deliver the required text or images to the receiving side.

Keywords: Augmented Information Delivery , Augmented Reality , Session Initiation Protocol , Video Telephony

## 1. Introduction

The limitations of the aforementioned studies are as follows. First, if additional information (e.g., the price of a book) needs to be sent during an SIP video call, users must find it and reply or send it separately through chatting. Second, when there is a lot of additional information, it takes a long time to transmit them simultaneously. Third, in order to search for additional information during the SIP call, the user has to end the currently used SIP program and start the program again later.

To resolve these issues, this study proposes an AR-based technique during an SIP-based video call that automatically displays augmented information on a remote video. AR is attracting great attention because it can be employed in several areas. Researchers are actively investigating on how to combine them with traditional services. The contributions of this study are as follows. First, by using AR in SIP-based video calls, a technique that allows users to easily deliver the necessary information, is presented. Second, an AR object encoding technique based on a real time protocol (RTP) is presented to efficiently transmit the information. Third, a prefetching technique was presented to reduce the transmission time required to deliver augmented objects during a video call. In particular, in a situation where non-face-to-face meetings increase due to the corona epidemic, SIP-based AR delivery is expected to be very useful.

The remainder of this paper is organized as follows. Section 2 describes the proposed scheme in detail. In Section 3, an evaluation is presented. Finally, Section 4 concludes the paper.

## 2. Proposed Framework

The internal structure of an efficient system (software) that supports this method is shown in Fig. 1.

Architecture for delivering augmented information during SIP-based video telephony.

In Fig. 1, the blocks on the left refer to the intelligent mobile augmented reality (IMAR) framework. As the IMAR framework was presented in a previous study [16], a detailed description of it is not provided here. The green portion on the right in the figure depicts the international standard protocol established by the IETF for video calls. To enable intelligent augmented reality (IAR)-based video calls, video telephony augmented reality (VTAR) blocks were added, which play the role of video call control between IMAR blocks and SIP protocol stacks. The details of these roles are as follows:

First, it performs the SIP protocol call connection control according to the predefined AR setting. The call connection was performed based on the procedures specified in the RFC3261 international standard. To efficiently support AR object transmission, the call connection procedure was changed slightly. In the proposed system, the block overlays the input image over the augmented object before sending the image, followed by encoding with a video codec (e.g., H.264). The encoded video is transmitted to the receiving end using a RTP. The implementation of this method is not difficult, but this scheme can cause substantial delays during video calls. To solve this problem, a method for separately sending the camera input image and augmented object from the transmission end is proposed. The procedure of the proposed method is described as follows. In the first step, after the SIP call connection is completed (after receiving SIP 200 OK or ACK), the method checks whether the AR-based video call is feasible on the transmission and receiving ends. If the AR-based video call is not possible either on the transmission or the receiving end, a video call is performed without AR. If possible on both sides, the second step is performed. In the second step, the transmission side receives metadata information about the augmented object from the metadata registry (MDR) server and delivers it to the receiving end. Metadata includes the name of the augmented data and the location of the server (Internet Protocol [IP] address) where the augmented object is stored. In the third step, the receiving end receives the augmented object data by using the location of the server where the augmented object is stored, which is included in the metadata information of the augmented object. The advantage of the proposed method is that it reduces the time required to overlay the augmented objects on the transmission side. The augmented object is encoded by a RTP packet, as shown in Fig. 2(a).

(a) Message sequence for transmitting AR augmented object using RTP. (b) Internal structure of the AR object transmission packet.

The structure of the packet is shown in Fig. 2(b). The descriptions of each field are as follows. The RTP header is a standard 16-bit starting sequence number specified in RFC 3550. The SIP_AROBJ_TX_START_FLAG field indicates the beginning of the message to transfer the augmented object, and its value is set to be 0xF8 (0B1111 1000). SIP_AROBJ_TX_END_FLAG indicates the end of the message and uses the same 0xF8 value as the end flag. The SIP_AROBJ_TX_ID field contains a unique identification value for the current message. The identification value is an 8-bit value, and it is created and then set using a random number in the device that initially generated the message. SIP_AROBJ_TYPE indicates the type of data that the transmitted message includes. Message types were divided into four categories. The currently sent message is the AR augmented object search request, the searched AR augmented object request, the AR augmented object search request response, and the response to the searched AR augmented object request. Values of 0, 1, 2, and 3 are set and used for each value. If a new type is to be added, the value can be expanded and used. The PAYLOAD field contains data transmitted from the sender. For example, in the case of an AR augmented object search request message, an object name and context information are included, and in the case of a response message, the augmented object name and information about the server containing it are included.

Second, the VTAR block controls the SIP call setup procedures between user agent client (UAC) and user agent server (UAS). A special control is not required if the call connection is made without AR; however, for an AR-based video call, modification of the procedures is required to transmit augmented objects to a mobile device. Two methods can be used to achieve this. The first method is to send an overlaid augmented object combined with an input image from the transmitting side. In the second method, the receiving side takes the augmented object separately, combines it with the video, and displays it to the user. In addition, the VTAR block controls the turning of the AR function on and off during a video call setup according to the mobile device interface. AR on and off can be done by adding a dedicated button on the mobile phone or by activating the AR video call by pressing an icon on the video call screen. If AR is activated, the call should be connected according to the changed SIP call connection procedure, as shown in Fig. 3.

(a) AR-based SIP outgoing call setup procedures. (a) AR-based SIP incoming call setup pro-cedures.

In the modified call procedure, the AR enable indication is delivered using a session description protocol (SDP) capability exchange message. When using this scheme, it takes more time to set up video calls because it requires an AR object request and reception at the sending side. To decrease the time, a pre-fetching scheme is presented in our work. In the scheme, objects to be augmented are determined during the call setup time, and these are transferred to a mobile device in advance. By doing this, users feel that the objects are augmented in real time. As mentioned earlier, the call setup time for the proposed system is longer than that for the original SIP setup time. The call setup time can be calculated using Eq. (1).

##### (1)
[TeX:] $$\text { SIP }_{\text {call_setup_time }}=I N V I T E_{t} * \text { INVITE } E_{n}+O K_{t} * \text { OK }_{n}+\text { Object_Augmentation }_{t}$$

Here, INVITEt represents the time to transmit the SIP INVITE message and [TeX:] $$I N V I T E_{n}$$ represents the number of the INVITE messages to be transmitted to setup the call. The reason that the number is more than one because retransmission occurs to recover the lost packet when the network condition becomes worse. In Eq. (1), [TeX:] $$O K_{t}$$ represents the time to transmit the call connection (OK) message and [TeX:] $$O K_{n}$$ is the number of retransmitted OK messages. In the same way as INVITE, the OK message is retransmitted repeatedly to recover the lost packet when the network condition worsens. Object_Augmentationt represents the time required to augment a target object to the sending video frame. Traffic distribution can be modeled in several ways. One way is to model it using an exponential distribution, as shown in Eqs. (2) and (3).

##### (2)
[TeX:] $$A R S I P_{p d f}(x)=c e^{-c x} a(x)$$

##### (3)
[TeX:] $$\operatorname{ARSIP}_{c d f}(x)=\left[1-e^{-c x}\right] a(x)$$

Eq. (2) represents the probability distribution function, and Eq. (3) shows a cumulative distribution function to model the AR-based SIP call setup traffic. Here, c is a local parameter of the density function. The parameter function a(x) is added to consider the time taken to augment the target object. The VTAR block controls the changed call connection procedure to be performed when a user makes an AR-based video call.

The proposed idea can be applied to search for the required products in commercial stores. For example, consider a bookstore. Currently, a majority of people feel that it is considerably difficult for them to locate the preferred book because there are numerous books in a store. Moreover, book stores aim to accumulate as many books as possible in a single store. A few stores provide a separate search system to help customers find books easily. However, users of this system may feel uncomfortable and require a significantly long period of time because they are unfamiliar with the system. Furthermore, to look at the abstract or summary of the book, they have to open each book or view it on the Internet. Occasionally, the required book may be located on a high shelf; in such scenarios, if store clerks are unavailable, the book cannot be accessed. The proposed system could be a suitable solution for such inconvenient situations. Future work will involve implementing the entire system and performing the complete experiment.

## 3. Evaluation

To discuss the feasibility of the proposed approach, our solution is qualitatively compared with other video communication schemes. In particular, it is indispensable in the medical and educational areas. Representative video applications in the medical industry are telemedicine, and Zooms and Webex are widely used in educational areas. Additional information transmission is not indispensable in these area. Table 1 presents the results of the comparison.

From the point of view of additional transfer content, all solutions support text and image transfer during peer-to-peer video communication. Interestingly, the telemedicine framework can transmit biodata to a central server in real time. In the solution, the biodata of patients is collected, such as pressure, temperature, and pulse to check status, encapsulated, and transferred together with video data. All other solutions support transfer of text, images, and additional file data. For the signaling protocol, Zoom and Webex used proprietary protocols, meaning that they created own protocols that can be used for setting and tearing down video calls. In this case, it is difficult to guarantee compatibility with other video communication systems. The telemedicine framework uses the H.245 video protocol specified by ITU-T (International Telecommunication Union-Telecommunication Standardization Sector). The protocol is specified as part of 3G-324M, which is specified by the 3GPP (third-generation partnership project) standard group. In the proposed approach, it is assumed that RFC3261 is used to handle the call setup and tear down. This is specified by the IETF group. The decision regarding which one is better for the transfer of additional content depends on the objective of the applications. For mobile video calls, 3G-324M is preferred; however, in the wired network, RFC3261 is better. One of the disadvantages of using standard protocols such as RFC3261 is that it is vulnerable to security.

A comparison between proposed approach and existing video communication system

For traffic control, the telemedicine framework uses the H.223 protocol and the RTP. H.223 is used to multiplex and de-multiplex audio, video, and data to synchronize with each other. The RTP is used to deliver multiplexed packets to the receiver in real time. The main function of the protocol is to recover the lost or error packets. The SIP protocol uses the RTP specification to deliver video and audio traffic in the same manner. The difference is that SIP does not specify a separate multiplexing protocol. Zoom and Webex developed a proprietary protocol for transferring video and audio data. For the protocol of additional data transfer, telemedicine uses H.245 video protocols. In the proposed approach, RTP and SDP protocols were used. The header of the SDP packet was longer than that of H.223. Zoom and Webex used a HTTPS to transmit additional data. In this protocol, data are encoded using HTTP and encrypted using transport layer security (TLS) to enhance security.

For context provision, only the proposed approach supports it by providing an IAR. In addition, the telemedicine framework partially supports context-based data transfer through the H.245 protocol. In this system, location-based context information is provided with biomedical data and used to find a patient’s location mainly by emergency ambulance or doctor helicopter to immediately respond to a patient’s emergency. Zoom and Webex do not support this function because these systems are developed to conduct online meetings rather than emergency situations.

To quantitatively evaluate the feasibility, we conducted an experiment to measure the call setup time for conventional SIP and AR object pre-fetch. In the AR object prefetching scheme, the augmented objects are taken in advance from the object store server. Thus, it can reduce the time spent on data augmentation during video communication. For this experiment, we used three desktop computers, PC1, PC2, and PC3, where PC1 and PC2 were used as SIP UAC and UAS, respectively, and PC3 as the AR object storing server. Details of the specifications are presented in Table 2.

Prototype system specifications

In this experiment, we first attempted to use an open-source SIP stack of version 3.0, which is widely used in the public domain to decrease the implementation time. However, it consumes a significant amount of time to change the stack and integrate AR object requests and transmissions. Therefore, we implemented a simple network application using a socket interface. To simplify the experiment, only INVITE and 200 OK messages were implemented, and other messages, such as 180 RINGING, ACK, and BYE are not implemented in the application. In addition, we used a peer-to-peer call setup scheme where a proxy server is not involved. The test scenario is shown in Fig. 4.

Furthermore, to simulate AR object request and transmission, we implemented a simple application where the UAC requests AR objects and receives the object from the AR object server. Here, the used objects are not real objects but only simple images whose size ranges from VGA size to large HD. The images were stored on a separate server and waiting for the client’s request to send them. If the server receives a request, it randomly transmits one image file to the receiver. Therefore, the main delay is the transmission time of the requested object (stored images). In this experiment, we did not implement the entire system because it consumes a large amount of time. Rather, we attempted to find out its feasibility. The experimental network is shown in Fig. 5.

Using this application, we measured the SIP call setup time. The measurements were divided into two categories. First, we measured the call setup time using the application without AR object prefetching. Next, we checked the time using the application with AR object prefetching. The AR object to be pre-fetched is chosen randomly; therefore, the transmission time of the object depends on the image size. The call setup time results are presented in Fig. 6.

(a) Message sequence for measuring SIP call setup time without AR object pre-fetching. (b) Message sequence for measuring SIP call setup time with AR object pre-fetching.
Networks and components set up for experiment.
A comparison of call setup time between conventional SIP and AR object prefetching.

When calculating the results, sleep time was distracted from the total call setup time. For example, for 50 calls in the conventional SIP call scheme, 50 (ms) are distracted from the total call setup time of 55.21 (ms). Consequently, the final total call setup time is 5.21 (ms). From the results, we observe that the proposed scheme has a longer call setup time than that of the conventional scheme. This is natural because before setting up the SIP call, it sends an additional request for AR objects and receives the requested object. When the AR object store server receives a request, it randomly chooses one image (object) from the image file stores. Also, we observe that the call setup time of object prefetching scheme jumps from 29.23 (ms) to 46.65 (ms) for number of calls between 200 and 250. The reason for this is that if a large image is selected at the object store server, the time taken to receive it increases dramatically.

## 4. Conclusion

Covid-19 and climate change are changing the human society, and people have to adapt them to the changing environment to survive. People will be painful, but, nature will be revived in proportional to the pain. One of the ways to reduce carbon emission is to minimize face-to-face contact between people. To do this, online conferencing system shall become more and more enormously used in wide areas. In this study, we have proposed a new scheme based on AR to overcome inconvenience of additional information delivery in SIP-based video calls. We have qualitatively evaluated the approach by comparing with other solutions, and quantitatively by measuring call setup time using a simple call generation application. The future work is to implement the whole system to fully evaluate the approach.

## Acknowledgement

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (No. NRF-2018R1D1A1B07045589, NRF-2021R1F1A106406911).

## Biography

##### Sung-Bong Jang
https://orcid.org/0000-0003-3187-6585

He received his B.S., M.S., and Ph.D. degrees from Korea University, Seoul, Korea in 1997, 1999, and 2010, respectively. He worked at the Mobile Handset R&D Center, LG Electronics from 1999 to 2012. Currently, he is an associate professor in the Department of Industry-Academy, Kumoh National Institute of Technology in Korea. His interests include augmented reality, big data privacy, prediction based on artificial neural networks.

## Biography

##### Young-Woong Ko
https://orcid.org/0000-0002-6292-0799

He received both a M.S. and Ph.D. in computer science from Korea University, Seoul, Korea, in 1999 and 2003 respectively. He is now a professor in Department of Com-puter engineering, Hallym University in Korea. His research interests include operating systems, embedded systems and multimedia systems.

## References

• 1 Y. Shi, "Research of the development of distance learning under the COVID-19 circumstances based on video conferencing software and MOOCs," in Proceedings of 2021 2nd International Conference on Education, Knowledge and Information Management (ICEKIM), Xiamen, China, 2021;pp. 154-158. custom:[[[-]]]
• 2 M. Schmidtner, C. Doering, H. Timinger, "Agile working during COVID-19 pandemic," IEEE Engineering Management Review, vol. 49, no. 2, pp. 18-32, 2021.custom:[[[-]]]
• 3 S. H. Islam, P. Vijayakumar, M. Z. A. Bhuiyan, R. Amin, B. Balusamy, "A provably secure three-factor session initiation protocol for multimedia big data communications," IEEE Internet of Things Journal, vol. 5, no. 5, pp. 3408-3418, 2018.doi:[[[10.1109/JIOT.2017.2739921]]]
• 4 X. Y. Guo, D. Z. Sun, Y. Yang, "An improved three-factor session initiation protocol using Chebyshev chaotic map," IEEE Access, vol. 8, pp. 111265-111277, 2020.custom:[[[-]]]
• 5 A. Montazerolghaem, M. H. Y. Moghaddam, A. Leon-Garcia, "OpenSIP: Toward software-defined SIP networking," IEEE Transactions on Network and Service Management, vol. 15, no. 1, pp. 184-199, 2018.doi:[[[10.1109/TNSM.2017.2741258]]]
• 6 J. Cavaleri, R. Tolentino, B. Swales, L. Kirschbaum, "Remote video collaboration during COVID-19," in Proceedings of 2021 32nd Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), Milpitas, CA, 2021;custom:[[[-]]]
• 7 R. Gupta, "Hybrid-flipped class room approach for fashion design students: mitigating impacts to learning activities due to emergence of COVID-19," in Proceedings of 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 2020;pp. 1-6. custom:[[[-]]]
• 8 R. Portillo, A. Alvarado, "Plenary: real-time transformation of the freshmen mathematics engineering courses during COVID-19 outbreak," in Proceedings of 2021 IEEE World Conference on Engineering Education (EDUNINE), Guatemala City, Guatemala, 2021;pp. 1-2. custom:[[[-]]]
• 9 S. Melenli, A. Topkaya, "Real-time maintaining of social distance in covid-19 environment using image processing and big data," in Trends Data Engineering Methods for Intelligent Systems. Cham, Switzerland: Springer, 2020;pp. 1-5. custom:[[[-]]]
• 10 P. Spathis, R. Dey, "Online teaching amid COVID-19: the case of zoom," in Proceedings of 2021 IEEE Global Engineering Education Conference (EDUCON), Vienna, Austria, 2021;pp. 1398-1406. custom:[[[-]]]
• 11 V. Mamone, V. Ferrari, S. Condino, F. Cutolo, "Projected augmented reality to drive osteotomy surgery: implementation and comparison with video see-through technology," IEEE Access, vol. 8, pp. 169024-169035, 2020.custom:[[[-]]]
• 12 H. Si-Mohammed, J. Petit, C. Jeunet, F. Argelaguet, F. Spindler, A. Evain, N. Roussel, G. Casiez, A. Lecuyer, "Towards BCI-based interfaces for augmented reality: feasibility, design and evaluation," IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 3, pp. 1608-1621, 2020.custom:[[[-]]]
• 13 A. Marto, M. Melo, A. Gonçalves, M. Bessa, "Multisensory augmented reality in cultural heritage: impact of different stimuli on presence, enjoyment, knowledge and value of the experience," IEEE Access, vol. 8, pp. 193744-193756, 2020.custom:[[[-]]]
• 14 J. Cejka, M. Mangeruga, F. Bruno, D. Skarlatos, F. Liarokapis, "Evaluating the potential of augmented reality interfaces for exploring underwater historical sites," IEEE Access, vol. 9, pp. 45017-45031, 2021.custom:[[[-]]]
• 15 L. Zhang, Z. Wei, W. Ren, X. Zheng, K. K. R. Choo, N. Xiong, "SIP: an efficient and secure information propagation scheme in e-health networks," IEEE Transactions on Network Science and Engineering, vol. 8, no. 2, pp. 1502-1516, 2021.custom:[[[-]]]
• 16 S. B. Jang, Y. W. Ko, "An efficient object augmentation scheme for supporting pervasiveness in a mobile augmented reality," Journal of Information Processing Systems, vol. 16, no. 5, pp. 1214-1222, 2020.custom:[[[-]]]

Table 1.

A comparison between proposed approach and existing video communication system
Comparison category Telemedicine framework Proposed approach Zoom Webex
Additional transfer contents Text, Image, Biodata Text, Images Text, Images Text, Images
Signaling protocol H.245 RFC 3261 Proprietary protocol Proprietary protocol
Traffic control H.223 and RTP RTP Proprietary protocol Proprietary protocol
Additional data transfer protocol H.245 RTP and SDP HTTPS HTTPS
Context provision Partially provided Provided Not provided Not provided
Augmented data transfer RFC SDP HTTPS HTTPS

Table 2.

Prototype system specifications
Experiment devices CPU Display (inch) Memory (GB) OS
PC1 (UAC) Intel Core i7-3770K CPU 3.5 GHz LCD 15 16.0 Windows 10 Pro
PC2 (UAS) Intel Core i7-4790 CPU 3.6 GHz LCD 15 4.0 Windows 10 Pro
PC3 (AR server) Intel Core i5-8500 CPU 3.0 GHz LCD 15 16.0 Windows 10 Pro
Architecture for delivering augmented information during SIP-based video telephony.
(a) Message sequence for transmitting AR augmented object using RTP. (b) Internal structure of the AR object transmission packet.
(a) AR-based SIP outgoing call setup procedures. (a) AR-based SIP incoming call setup pro-cedures.
(a) Message sequence for measuring SIP call setup time without AR object pre-fetching. (b) Message sequence for measuring SIP call setup time with AR object pre-fetching.
Networks and components set up for experiment.
A comparison of call setup time between conventional SIP and AR object prefetching.