1. Introduction
Thanks to advancements in wearable technology, the human activity recognition from egocentric vision often mentioned as first-person vision, provides much potential research. These advancements give the possibility to detect the surroundings and the subject's activities from his viewpoints. Fig. 1 presents some examples of wearable equipment.
Examples of wearable equipment.
In a process that can be used in a variety of settings, as well as mobile/ambient assisted life, assistant for personal health care, interaction between humans and computers, industrial environments, observation systems and smart buildings, the task of this type of recognition is to identify what action is being performed in a given egocentric video segment. We can also use it to locate visitors to a cultural or outside natural site, analyze their activity automatically to better understand their preferences, inform them about where they are and what they can view.
For this purpose, a variety of methodologies has been developed. They may be classified into two categories: machine learning algorithms and neural network techniques [1,2]. In summary, the first category includes decision trees, support vector machine, hidden Markov models, and k-nearest neighbor method. The second one includes artificial neural network, recurrent neural network (RNN), and convolutional neural network (CNN) which is the most widely used deep learning algorithm. Owing to the appearance of big data and increasingly powerful computing components, the power and dataintensive deep learning algorithms have overtaken most other methods.
That is why, in this paper, we use the deep learning to egocentric human activity recognition. It tends to work well with a significant volume of data, while more traditional machine learning models, which are powerful programming tools allowing in particular, the recognition of images by automatically attributing to each one provided as input, a label corresponding to its membership class, stop improving after a saturation point.
Our investigations were in the context of kitchens; in particular, on the Carnegie Mellon University Multi-Modal Activity (CMU-MMAC) database [
3]. We are collecting data from only the ego-vision videos, where some subjects have been captured cooking five different recipes: brownies, pizza, sandwich, salad, and scrambled eggs (Fig. 2).
Egocentric video database frames from the CMU-MMAC dataset (http://kitchen.cs.cmu.edu/main. php): preparing brownie (a), pizza (b), sandwich (c), salad (d) and scrambled eggs (e).
Our proposed method is simple and efficient. It allows the increase of the accuracy obtained in the state-of-the-art approaches, for the same database [3], compared to either methods using only the egocentric videos, or those combining egocentric videos and inertial measurement units (IMUs).
This document is organized in the following way: Section 2 discusses the state of the art, Section 3 resumes the methods used, and Section 4 provides a description of the dataset utilized. Sections 5 and 6 present experimentations and future objectives respectively. Finally, Section 7 gives some conclusions.
2. Related Work
Widely studied in previous research, egocentric action recognition uses a variety of sensor modalities. In this section, we give a non-exhaustive summary of previously published works in a chronological order.
In [4], the authors used a wearable camera and IMUs from the CMU-MMAC database [3] to investigate first-person perception. They conducted a supervised and unsupervised temporally segmenting of human motion into actions and classify activity. Fathi et al. [5], showed that combined modeling of activities, actions, and objects improves performance than when they are analyzed separately. Later in [6], by using two new datasets including egocentric videos of daily activities and gaze, they showed improvements in action recognition rates and gaze prediction accuracy compared to state-of-the-art approaches. In [7], the authors develop several models of daily activities based on object-centric representations.
Afterward, Ryoo and Matthies [8] were looking into multichannel kernels as a way to combine global and local motion data, describing a new activity learning/recognition approach that takes temporal structures presented in first-person activity videos into account. Next, Song et al. [9] used Google Glass to create an egocentric video dataset called LENA (Life-logging EgoceNtric Activities). They used LENA to evaluate the state-of-the-art activity recognition and looked at how popular descriptors performed in egocentric activity recognition. Later in [10], the authors use a bi-linear maximum margin model to find the appropriate camera important factors to maximize action prediction accuracy. Ryoo et al. [11] introduced a model for temporally pool features in order to recognize egocentric actions with [10], using the CMU-MMAC database [3]. In [12], the authors evaluated how different egocentric cues (such as gaze, the presence of hands, objects, and head movement) can be employed to perform the task.
Thereafter, Ma et al. [13] created a deep learning architecture that enables them to combine various egocentric-based features to identify actions. Otherwise, Song et al. [14] to solve the egocentric activity recognition challenge, suggested combining video and temporal improved sensor characteristics using the Fisher kernel framework, proposing, in [15], a multimodal multi-stream deep learning system that uses both video and sensor data. Singh et al. [16] proposed CNNs for classification of wearer’s actions, by recording hand stance, head motion, and saliency map utilizing egocentric cues. Moreover, in [17], the authors explored CNN and temporal segment networks, using hands movements and what object is being manipulated for analyzing first-person action.
Furthermore, Khalid et al. [
18] begin by surveying all existing egocentric datasets. The authors then include the Swain's distance into a dynamic time warping method and utilize it to construct an algorithm that employs visual lifelogs to automatically classify daily activities. Singh et al. [
19] used improved dense trajectories to solve the difficulty of recognizing egocentric actions.
In another field, with the procedure being then repeated for the duration of the video, Liu et al. [20] applied a beam search to recognize the fluent item in each frame concurrently. Possas et al. [21] developed a model-free reinforcement learning technique for learning energy-aware rules that maximize the use of low-energy cost predictors while maintaining competitive accuracy levels. They demonstrated that a policy developed on an egocentric dataset may efficiently tradeoff energy expenditure and accuracy by utilizing the synergy between motion and vision sensors. Li et al. [22] introduced a revolutionary deep model for simultaneous gaze estimation and egocentric action identification.
In [
23], the authors developed a spatial attention method that allows the CNN to pay attention to regions containing objects that are connected to the activity, doing this before using them for spatiotemporal encoding of video with a long short-term memory (LSTM). Later in [
24], they proposed long short-term attention as a technique for focusing on features from relevant spatial parts while attention is followed smoothly over a video sequence. In [
25], a multi-modal fusion architecture has been proposed. It has been trained from beginning to end to outperform individual modalities and late fusion of modalities. Thereafter, Lu and Velipasalar [
26], by employing 10 videos representing five different subjects (two videos per subject) for training and testing, developed and implemented a genetic algorithm-based method for optimizing multiple parameters of their network architecture autonomously and simultaneously. They used the CMU-MMAC database [
3].
On the other hand, Diete and Stuckenschmidt [27] investigate the transfer of deep learning models in vision to models for activity recognition and object detection by combining inertial and video features. In [28], the authors deal with the issue of egocentric action anticipation. The rolling-unrolling (RU) LSTM was presented as a learning architecture for anticipating actions from egocentric videos. In the same context, Rodin et al. [29] proposed ideas on how to improve the quality of predictions and reviewed the current approaches for action anticipation from egocentric video. By introducing and benchmarking different changes based on some objectives cited in their paper, they propose to extend the RU-LSTM model [28].
Besides, Min and Corso [
30] presented a probabilistic method for integrating human gaze to spatiotemporal attention to recognize egocentric activity. In another issue, Ragusa et al. [
31] proposed a new dataset named "MECCANO", establishing the egocentric human-object interaction (EHOI) detection task and conducting baseline experiments to demonstrate the dataset's potential, while the dataset was focused to exploring EHOIs in an industrial setting. In [
2], the authors used first-person camera data, from the CMU-MMAC database [
3] and, by considering only three actions (or recipes; brownies, scrambled eggs, and sandwiches) performed a deep learning to extract and recognize features. This was instead of considering all five actions present in this database. The methods used in this research are described in the following section.
3. Methods Used
In this work, we exploit a specific type of deep learning, which is the CNN. It is considered one of the most efficient deep learning algorithms because of its performance in image classification and action recognition [32]. The following is a quick description of this latter.
3.1 Convolutional Neural Network
A deep CNN model is composed of a limited number of processing layers that could learn different characteristics of input data (for example, image). Description of these different layers is shown in Fig. 3.
Example of a CNN processing for image classification of a brownie preparation.
3.1.1 Layers in CNN
With each one executing different functions to translate one volume, a CNN is composed from three basic layers, convolutional layer, pooling layer, and flattening and fully connected layers.
Convolutional layer
The most crucial layer in any CNN design is the convolutional layer. It is composed of a set of convolutional kernels (also known as filters) that are convolved with the input image (N-dimensional metrics) by a simple mathematical convolution, to produce an output feature map.
The convolutional layer is characterized by the following hyperparameters: the first one is the size and the number of filters. The second one is the stride value “S” with which we drag the window corresponding to the filter on the image, and the third one is the zero-padding “P”. In this last hyperparameter, and with the padding of pixels being necessary in order to accentuate the input image’s border size information, we add a black (shades of gray = 0) outline to the input image with a layer of thickness “P” pixels. However, the border side features are erased away too rapidly if no padding is used.
Example of 2D convolution with no padding to the input image and a kernel stride of 1.
Fig. 4 presents an example of 2D convolution with no padding to the input image and a kernel stride of 1, while Fig. 5 is showing an example of 2-D convolution with zero-padding, “P” = 1 for the input image and a kernel stride of 3.
Example of 2D convolution with zero-padding P=1 for the input image and a kernel stride of 3.
Pooling layer
After convolution operations, the pooling layer is utilized to sub-sample the output feature maps in order to reduce the convolved feature size. This is useful for obtaining dominant features that are invariant in terms of position and rotation [
33]. This layer has two hyperparameters: The pool size “F”, used to split the image into square cells of size F×F pixels and the stride value, which is defined as a vector, containing two positive integers [a b], with “a” representing the vertical step size and “b” representing the horizontal step size. The stride can be set as a scalar when this layer is created to utilize the same step size value for both vertical and horizontal dimensions. An example of a max pooling technique is illustrated in Fig. 6. The pooling operated in this case replace all values in the cell of 2×2 size by the max value in the mask.
Example of a max pooling technique with stride value of 2 and pool size F=2.
Flattening and fully connected layers
Flattening and fully connected layers are the last part of every CNN architecture (Fig. 7). Flattening is converting the data into a 1-dimensional array to inject them into the fully connected layers. The term, fully connected, means that every neuron inside a layer is linked to every neuron from the preceding one. This final layer of fully connected layers and the output of the CNN is the classifier, where each neuron assigns to the image a probability value of belonging to one class among the remaining possible classes.
Example of flattening and fully connected layers.
3.1.2 ReLU activation function
The rectifier linear unit (ReLU) activation function (Fig. 8) is widely used in convolutional neural networks, between the convolutional and pooling layer, because it requires less computation load compared to other activation functions used in this field.
In our proposed CNN deep learning method, we employ the ReLU activation function presented by Eq. (1):
3.1.3 Epoch
An epoch is defined as one cycle during the entire training dataset. Although there is no guarantee that increasing the number of epochs will improve the network convergence, generally, it takes several epochs to make the training CNN. It is a way to review the previous data and readjust parameters of the training model.
ReLU activation function.
3.1.4 Evaluation metric
As a metric for evaluation, we employ recognition accuracy. It explains how the model works in all classes. This metric may be beneficial when all classes are equally important. It is the ratio between the number of correct predictions and the total number of predictions.
The dataset utilized in this paper is described in the following section.
4. The Dataset Utilized
In our experiments, inclusive a multimodal measure about the human activity of persons executing actions related to food preparation and cooking, we use the CMU-MMAC database [3]. With 25 subjects have been captured preparing five different recipes—sandwich, salad, brownie, pizza, and scrambled eggs (Fig. 2), this database was generated in the Motion Capture Lab at Carnegie Mellon University.
Video, audio, motion capture, and inertial measurement were recorded using cameras, microphones, a Vicon motion capture system, and wired/Bluetooth IMUs, respectively. A BodyMedia and an eWatch were employed as wearable gadgets. The detailed characteristics of each equipment are given in [3]. In addition to an auxiliary dataset including anomalous situations being available, the database includes a main dataset where subjects are cooking five recipes. In this context, three subjects are cooking while some atypical situations occur (falling dishes, fire and smoke, distractions, etc.).
In the proposed method, we are exploiting only the first-person video from the main dataset, for each subject cooking the five different recipes cited above.
In next section, we present our experimentations.
5. Experimentations
In this section, before the specifications of the training options for the proposed CNN model being given, we firstly describe materials and software employed in this study, the pre-processing of the CMUMMAC dataset, as well as the architecture of the suggested CNN model for deep learning. Next, results and discussions follow, respectively.
5.1 Materials and Software
We present in the following the characteristics of the computer, the digital calculation and programming platform, plus the video file frame extraction tool [34] used in this work. Table 1 summarizes these characteristics.
Materials and software used
5.2 Pre-processing of CMU-MMAC Dataset
From the CMU-MMAC database [3], we generate a new one containing five labels or classes called, sandwich, salad, brownie, pizza and scrambled eggs, according to the five prepared recipes.
In each label, we put the ego-videos of the different subjects, which have been recorded cooking. Then, we use the Free Video to JPG converter [34], which is a software, dedicated to frames extraction from videos (Fig. 9). With the total number of frames obtained from each ego-video providing us with the necessary amount of information, the process conducts a video temporal sampling, where the sampling period is a parameter to be chosen. In our case, we take one frame in every half second.
Finally, we arrive to recognize the activity being carried out and therefore which recipe is being prepared by using the proposed CNN, illustrated in the next section, to classify every test input image in its corresponding class.
Free Video to JPG Converter interface (https://www.dvdvideosoft.com/products/dvd/Free-Videoto- JPG-Converter.htm).
5.3 Architecture of the Proposed Deep Learning CNN Model
Fig. 3 illustrates the architecture of our proposed deep learning CNN model. It is composed from four convolutional layers and four max pooling layers. Table 2 summarizes the corresponding hyperparameters values.
Architecture details of the proposed deep learning CNN model
5.4 Specifications of the Training Options for the Proposed Deep Learning CNN Model
Using Stochastic Gradient Descent with Momentum, we create a set of options for training the network [35,36]. This method helps the network to accelerate gradient vectors in the right directions and avoid local minima.
Before giving the corresponding values of the training options in Table 3, we present some definitions [35]:
· Initial learning rate is a positive scalar, if it is too low, the training will take a long time. Whereas, if this one is too high, then training may produce unsatisfactory results.
· Size of the mini-batch utilized for each training iteration, defined as a positive integer. · A mini-batch is a subset of the training set used to calculate the loss function's gradient and adjust the weights.
· Shuffle: Data shuffle option, which might be ones that follow:
“once”: Before training, shuffle the training and validation data once.
“never”: Do not shuffle the data. “every-epoch”: Before each training epoch shuffle the training data and before each network validation, shuffle the validation data. To prevent discarding the same data every epoch, set the shuffle option to “every-epoch”.
· Validation Frequency is a positive integer that represents the frequency of network validation in number of iterations.
Training options values of the proposed method
5.5 Results
To test our CNN deep learning model performances, we chose to consider, for the preparation of each recipe, one, two, three, four, five and six subjects. For each case, we varied the number of epochs considering 20, 30, 40, 50, 60 and 70 epochs and taking different percentages (%) of training images equal to 80, 85 and 90. The accuracy of egocentric activity recognition of these different cases is shown in Table 4.
These results will be discussed in the next section.
5.6 Discussion
One can see from table 4 that a maximum accuracy of 99.41% is reached for thirty epochs in the case of one subject per recipe with a percentage of training images equal to 90%. Then, for the remaining considered cases, the maximum of this rate varies between 96.45% and 99.13% according to the epochs number choices which range from 30 to 70 epochs and the considered percentage of training images.
This means that our proposed method almost allows solving the variability problem in action executions. Indeed, subjects, in CMU-MMAC database [3], were not provided instructions on how to conduct the recipe [4]. In our proposed method, we almost got over this problem.
Our model remains competitive and efficient whatever the number of considered subjects and for all the recipes present in this database [3]. For comparison purpose of the proposed method, we selected related works, which used the CMU-MMAC dataset and the same evaluation metric, the accuracy rate.
Accuracy of the proposed method in various cases
As can be observed in Table 5, the accuracy of the proposed method is outperforming that of [4] by 41.61%. Comparing with [10], using egocentric camera data only for both cases, and using data from egocentric camera with multiple static cameras, the given accuracy is less than ours by 61.49% and 44.79%, respectively. Finally, considering [26], using the genetic algorithms, the given accuracy rate is less than ours by 12.77%.
The proposed method is already better in terms of accuracy as shown in Table 5. It presents a global recognition without any exception or constraints. As it is shown in table 4, we considered 5, 10, 15, 20, 25 and 30 videos representing one, two, three, four, five and six different subjects in each recipe preparation from the five recipes presented in the database used. The proposed method works without any exception, while maintaining a maximum of accuracy rate between 96.45% and 99.41% depending on the considered case. If there were other recipes, we could integrate them into our proposed method and recognize them very easily without any problem or obligation.
Comparison of proposed method versus different approaches using CMU-MMAC dataset
The proposed algorithm is simple and easy to apply, it consists of taking a few frames by sampling an egocentric video only, making a classification, and recognizing the activity in question. The sampling done every 0.5 seconds allows getting closer to real-time activity recognition.
On the other hand, Soran et al. [10] used both egocentric and multiple static cameras to perform their method, studies in [2], [4], and [26], for example, used both the first-person camera data and IMU to extract actions from different activities before applying their proposed methods. The computation load of the proposed method is almost that required by the deep learning algorithm.
6. Future Objectives
In future research, using the same CMU-MMAC database, the following situations are targeted:
· Anomalous situations, which can occur (fire and smoke, falling dishes, distractions, etc.). Here, using a deep learning to detect such cases could be useful for intervention to help or rescue.
· Predefined situations, where the subjects follow a weekly cooking program. Hence, while knowing the recipe cooked today, one can anticipate that of tomorrow. This in turn can be useful to check the availability of all necessary ingredients, or simply make a reminder. This could be achieved using RNN model.
Other suggestions for future works are to use the MECCANO dataset, to investigate human-object interactions in the industrial context. To detect the current action in a production chain is a matter, and also to then anticipate the next one. Thus, a checking by recognition process using RNN deep learning could be conducted, and a decision is made. If the next anticipated action is executed correctly, no intervention is needed, otherwise, an error message is triggered to rectify or resume the action.
7. Conclusion
We have presented a simple and efficient classification method using egocentric camera data only from the CMU-MMAC database. The data-used reduction accelerates the process of human activities recognition. Then, we extract frames by temporal sampling of egocentric videos by taking one frame every half second to get closer to real-time activity recognition. After that, we prepared a new database containing five labels according to the five prepared recipes of database [3]. On this new database, we applied a classification using our proposed deep learning CNN algorithm.
The exploitation of this algorithm proved its effectiveness in recognizing the activities in question with a very satisfactory accuracy equal to 99.41% when one subject was performing the five recipes. We have a maximum of accuracy varying between 96.45% and 99.13% when many subjects were preparing these recipes each in his or her own way. It is important to notice that proposed method remains effective with whatever egocentric video data and the manner in which the subjects carry out their activities. The accuracy of the proposed method exceeds [4] by 41.61%. When compared to [10], the accuracy provided is less than ours by 61.49% and 44.79% in the cases of data from egocentric cameras alone and from egocentric cameras combined with multiple static cameras, respectively. In consideration of [26], the accuracy rate provided by the genetic algorithms is 12.77% lower than ours.