1. Introduction
In recent years, using point clouds for three-dimensional (3D) object classification has become a popular method in various fields, such as face recognition [1], 3D modeling [2], intelligence surveillance [3], and robotic missions [4]. The characteristics of point clouds, for example, high precision, are vital for object classification. Compared to traditional two-dimensional (2D) images captured by a camera, point clouds have some significant advantages. For example, point clouds are invariant to lighting, rotation, and color. However, the unordered arrangement, inhomogeneous densities, and non-structural distributions of point clouds present challenges for object classification.
To address these problems, researchers use point cloud features as the basis for classification rather than processing raw point cloud data. Most methods, such as 3D shape context [5], point signature [6], and clustered viewpoint feature histogram [7], focus on analyzing and extracting point cloud features, such as geometric, shape, and structural attributes, or combinations of multiple attributes [8]. Traditionally, data size may increase unnecessarily when input to convolutional networks because convolutional networks require regular inputs. Thus, PointNet [9], a neural network structure that takes the original point clouds as input, has been developed. This paper proposes a lightweight pointwise convolution neural network (CNN) structure for 3D object classification. The proposed pointwise CNN requires fewer training epochs than the PointNet and achieves 87.07% classification accuracy on the ModelNet10 dataset.
The remainder of this paper is organized as follows. Section 2 introduces previous studies to 3D object classification in point clouds. Section 3 describes the proposed network structure. Section 4 describes the experiments and evaluates the classification results. Conclusions and suggestions for future work are provided in Section 5.
2. Related Work
Researchers have done considerable work in object classification using 3D point cloud data. Such studies primarily employed feature-based, graph matching-based, and machine learning-based object classification methods.
In general, feature-based object classification methods can be classified as global and local methods. Methods that combine two categories of features also occur in the face recognition domain [10]. In global feature-based methods, target objects must first be separated from the background. Then, their geometric features are matched to classify the objects. Drost et al. [11] introduced oriented point pair features and employed a fast voting scheme to match models locally. Their proposed method grouped similar features of a model when mapping point pair feature space to the model, which increased the speed of the algorithm when processing point cloud data. Rusu et al. [12] introduced the view feature histogram (VFH), which added viewpoint information to the extended fast point feature histogram (FPFH) information introduced in their previous work [13]. The VFH method could recognize both objects and their pose. In addition, it collected statistics of the relative angle between the surface normal and the direction of the central viewpoint. Therefore, the computational cost of the VFH was reduced, which allowed it to be used in real-time processing. Rusu et al. [14] also proposed global fast point feature histograms to capture the local geometric relationships in an object. In that study, a support vector machine was employed to label and recognize objects, and high classification accuracy was achieved. Marton et al. [15] introduced a global radius-based surface descriptor (RSD) that rasterized the point cloud and sorted the cambers first. Then, the RSD was used to calculate features for classification. Wohlkinger and Vincze [16] introduced an ensemble of shape functions (ESF) for real-time 3D object recognition. This descriptor combined angle, distance between two points, and shape functions, and significantly improved recognition rate. However, experimental results indicated that the ESF descriptor had difficulty identifying similarly shaped objects. Chen et al. [17] proposed the global Fourier histogram (GFH) descriptor, which could use the cylindrical angular coordinate and achieve rotational independence around the vertical axis. Their experimental results indicated that the GFH classified almost all objects correctly; however, it encountered the same problem as ESF, i.e., the descriptor might return incorrect results if two objects had a similar shape. In general, global feature-based methods failed to describe some details of the object and were likely influenced by noise and occlusion.
To address the limitations of global features, researchers focused on developing methods that used local features for classification, for example an object’s key points. Spin image (SI) is a classic method introduced by Johnson and Hebert [18]. In SI, 2D spin images were used to represent 3D point clouds. Although SI showed robustness against noise and occlusion, it required uniform distribution of point cloud data. Guo et al. [19] introduced the Tri-Spin-Image (TriSI) descriptor, which recognized 3D objects under clutter interference and occlusion conditions. Their system included a local reference frame and generated signatures. Then, these signatures were used to compress TriSI features, and a hierarchical feature matching method was employed for object recognition. TriSI showed robustness against noise and varying resolutions, which resolved the weakness of the classic SI method. Rusu et al. [20] introduced a point feature histogram (PFH) and a FPFH [13]. The PFH calculated features, such as the distance between two points and the angle between two points’ normal vectors, and then mapped those features to a histogram to obtain statistics. However, the PFH algorithm was highly complex. Thus, the FPFH was developed to improve the efficiency of PFH. FPFH used the relationship between a normal vector and its k-neighbor, which made FPFH more efficient than PFH. Salti et al. [21] introduced the signature of histograms of orientation (SHOT) method, which was robust against noise and rotational invariance. Prakhya et al. [22] improved the SHOT method [21] by introducing a “binary” 3D feature descriptor called binary signature of histograms of orientations (B-SHOT). They employed binary quantization to convert a real vector to a binary vector, which made B-SHOT run faster and require less memory than SHOT. He and Mei [23] proposed a new spin image-based registration (SIR) algorithm using 3D feature space that included the Tsallis entropy of the spin image and the laser sensor’s reflection intensity. The new SIR algorithm improved the robustness and computational speed of previous SIR algorithms.
With the successful application of deep learning systems in many fields, researchers also focused on applying deep learning systems to 3D data. Maimaitimin et al. [24] converted point clouds to a surfacecondition- feature map for feature extraction using an autoencoder. The geometric features extracted work well even with extremely noisy data. Su et al. [25] proposed a multi-view CNN that rendered point clouds into 2D images (views), and then combined information from different views for classification. A viewpooling layer was employed to pool the extracted view-based features from different views. The classification accuracy of their proposed method was significantly better than other 3D shape descriptors. Bobkov et al. [26] developed a point pair descriptor and a four-dimensional (4D) CNN for object classification. The proposed method showed robustness to noise and occlusion in point clouds; however, tuning network hyperparameters required optimization and the computational cost of a 4D convolution cloud was quite high. Ben-Shabat et al. [27] introduced a new representation of a 3D point cloud called 3D modified Fisher vectors. Employing continuous generalized Fisher vectors and a coarse discrete grid structure provided robustness against data loss. Li et al. [28] introduced a new deep neural network structure called a field probing neural network that employed field probing filters to extract features from volumetric space effectively; therefore, it ran much faster than traditional 3D CNNs. Zhou and Tuzel [29] proposed VoxelNet, a 3D detection network that puts a point cloud into an equidistant 3D voxel and then calculates a unified feature for each voxel using a voxel feature encoding layer. Engelcke et al. [30] proposed an efficient CNN to recognize 3D objects in point clouds. They introduced feature-centric voting scheme convolutional layers for recognition. Their model used fewer layers than previous models and simultaneously realized competitive processing time. Fang et al. [31] defined a deep shape descriptor (DeepSD) for a deep neural network. DeepSD demonstrated robustness against noise, incompleteness, and structural variations. Klokov and Lempitsky [32] proposed a deep learning network structure named Kd-network. Differing from traditional convolutional networks, Kd-networks did not require rasterization prior to processing point clouds, which eliminate the time required for rasterization and made the network run faster.
Researchers have also attempted to find ways to handle unordered input data because significant time is required to convert unordered data to ordered data. Vinyals et al. [33] proposed a framework with attention mechanisms that include read, process, and write blocks. Although this framework could handle unordered input sets, their method addressed language processing rather than 3D object classification, and employing that method for 3D object classification tasks was difficult. In this paper, we employed a CNN to classify 3D objects under point cloud data.
3. Pointwise CNN-Based 3D Object Classification Method
We propose a pointwise CNN for 3D object classification (Fig. 1). In our work, raw point cloud data is input to the network directly without any preprocessing. The developed pointwise CNN consists of four convolution layers (Conv1–Conv4), one max pooling layer, and four fully connected layers (FC1– FC4). The Conv1 layer applies 64 1×3 convolution kernels. The other three layers apply 128, 256, and 512 1×1 kernels, respectively. The max pooling layer employs a global max pooling method to extract 512 features from the convolution results of the Conv4 layer. Four fully connected layers, which contain 512, 256, 128, and k neurons in each layer, provide the classification results. A backward process is implemented for training to update the weight matrixes in the convolutional and fully connected layers, and the bias vectors in the fully connected layers. A dropout layer with a drop rate of 0.2 is implemented before FC4 to prevent overfitting. Variable N in Fig. 1 is the number of points.
Proposed pointwise CNN for 3D object classification.
3.1 Forward Process for Training and Testing
Two types of convolutional kernels are used in this study. In the first convolutional layer, we use 1×3 kernels to combine the coordinate information of each point in the point cloud. Fig. 2 and Eq. (1), respectively, illustrate and represents how these kernels work.
In Eq. (1), operator * represents convolution, variable [TeX:] $$p_{n}$$ is the coordinate information of a point where n ranges from 1 to N where N is the number of points, [TeX:] $$r_{u, n}$$ is the result of a convolution, [TeX:] $$k_{u}$$ is the [TeX:] $$u^{t h}$$ convolutional kernel where variable u ranges from 1 to m, and variable m is the number of convolution kernels.
The size of the output array is calculated in two steps. First, the width and height are calculated using equations [TeX:] $$O_{w}=I_{w}+1-k_{w} \text { and } O_{h}=I_{h}+1-k_{h},$$ respectively. Second, [TeX:] $$O_{w} \text { and } O_{h}$$ are multiplied to obtain the array size. Variables [TeX:] $$O_{w}, O_{h}, I_{w}, I_{h}, k_{w}, \text { and } k_{h}$$ represent the width and height of the output array, input array, and kernel, respectively.
Convolution computed by 1×3 kernels.
In the other three convolutional layers, 1×1 convolution kernels are used to combine the information from different channels. Each kernel convolutes the values at the same position from different channels in the input arrays. The convolutional results will be saved at the same position in the output array. The output arrays and the kernels are in a one-to-one correspondence. Fig. 3 illustrates how a 1×1 convolutional kernel works where variable d represents input and kernel channel.
Convolution computed by the 1×1 kernel.
Note that no biases are added to the output results in all convolutional layers prior to the results being sent to the activation function. The activation function for all four convolutional layers is the leaky ReLU defined in Eq. (2).
Here, m represents the values in output arrays.
To solve the problem of irregular input, we need to utilize a symmetry function as a pooling method. Thus, we employ the global max pooling function to extract 512 features from the last convolutional layer, as shown in Fig. 4. The pool size of this max pooling function equals the size of the output array in the last convolutional layer.
Global max pooling operation.
The features extracted from the last convolutional layer are then sent to the four fully connected layers. The operational principle of the fully connected layers is shown in Eqs. (3) and (4) where matrix W is a matrix that contains the weights of the layer, function [TeX:] $$\sigma(\cdot)$$ is the leaky ReLU function defined in Eq. (2), and variables [TeX:] $$b_{i} \text { and } x_{i}$$ represent the bias and input vectors in the [TeX:] $$i^{t h}$$ layer, respectively. The activation function for first three fully connected layers is the same as the activation function defined in Eq. (2).
The results of the last layer are then sent to the softmax activation function shown in Eq. (5) where [TeX:] $$S_{i}$$ is the result of a softmax operation on a neuron, [TeX:] $$a_{i}$$ is calculated by Eq. (4), and variable c is the number of object categories.
We select the max value from [TeX:] $$s_{i},$$ and then take the number i as the classification result. Note that testing is performed at this stage. However, training requires an updating process for weights and kernels in the network.
3.2 Backward Process
The first step of the backward process used for training is calculating the loss rate q using a loss function. We employ the cross-entropy function shown in Eq. (6), where yi is an one-hot encoded label for the [TeX:] $$\dot{l^{t h}}$$ neuron, [TeX:] $$S_{i}$$ is the output of this neuron after the softmax operation, and c is the number of categories.
The gradient descent method is implemented to update the weight or the kernel in the network. Thus, we need to calculate the gradient for each layer and then update the weight or the kernel of each layer accordingly. The gradients of each layer are calculated by the chain derivation rule, which means that the gradient of the last fully connected layer is the most important for updating. Eq. (7) shows how to calculate the gradient of the [TeX:] $$i^{t h}$$ output result [TeX:] $$z_{i}$$ in the last fully connected layer. In Eq. (7), variables q and [TeX:] $$s_{j}$$ have the same meaning as in Eq. (6), [TeX:] $$z_{i}$$ is calculated by Eq. (3), and variable c is the number of categories.
Eqs. (8) and (9) provide the results of two partial differentials in Eq. (7).
Combining the above results, Eq. (7) is equivalent to Eq. (10), where variable c has the same meaning as in Eq. (7) and variable [TeX:] $$y_{i}$$ has the same meaning as in Eq. (6). The one-hot encoder is employed in this study; therefore, the gradient of the last fully connected layer can be calculated using Eq. (11) where [TeX:] $$y_{i}$$ and [TeX:] $$S_{i}$$ have the same meaning as in Eq. (6), and [TeX:] $$z_{i}$$ is calculated by Eq. (3).
The gradients are then used to update the weights and biases in the fully connected layers. The error rates of the last layer and other fully connected layers can be calculated by Eq. (12). Eqs. (13) and (14) show how to calculate the new weight of a layer.
In Eqs. (12)–(14), operator denotes the Hadamard product, variable [TeX:] $$e_{i}$$ is the error rate of the [TeX:] $$i^{t h}$$ layer, [TeX:] $$W_{i}$$ is the weight matrix of the [TeX:] $$i^{t h}$$ layer, and variable l is the learning rate for fully connected layers. Variable [TeX:] $$b_{i}$$ is the bias vector of the [TeX:] $$i^{t h} \text { layer, } a_{i-1}$$ is the input of the [TeX:] $$i^{t h}$$ layer, and [TeX:] $$z_{i}$$ is the output of the [TeX:] $$i^{t h}$$ layer before the activation function operation. According to the leaky ReLU function defined Eq. (2), the values of function [TeX:] $$\sigma^{\prime}\left(z_{i}\right)$$ are shown in Eq. (15).
The upsampling operation is done by the max pooling layer, which puts the error rates transmitted from the previous fully connected layer into the position where the maximum was in the forward process. The blank positions in the pool will be filled with zeroes.
There are two ways to obtain the error rate for convolutional layers in the backward process.
1. If the previous layer is the max pooling layer, the error rate of the previous layer is transmitted to current layer directly.
2. If the previous layer is a convolutional layer, the error rate of the current layer should be calculated based on the error rate of the previous layer. To calculate the error rate, convolution is implemented as shown in Eq. (16). In Eq. (16) [TeX:] $$e_{i}$$ is the error rate of the [TeX:] $$i^{t h} \text { layer, } k_{i}$$ are the kernels in the [TeX:] $$i^{t h}$$ layer, operator * means convolution, and function [TeX:] $$\operatorname{rot}(\cdot)$$ means to rotate the kernel 180. Note, function [TeX:] $$\sigma^{\prime}\left(z_{i}\right)$$ and operator are defined as in Eq. (12).
Eq. (17) shows how the kernels are updated. Here variable [TeX:] $$k_{i}$$ is the kernel in the [TeX:] $$i^{t h}$$ convolutional layer, [TeX:] $$e_{i}, a_{i-1}, \text { and } l$$ are the error rate, input data, and the learning rate for this layer, respectively. Operator * denotes convolution.
4. Object Recognition Experiments
The proposed network was tested on the ModelNet10 dataset [34]. The dataset contains 4,899 objects from 10 categories, i.e., bathtub, bed, chair, desk, dresser, monitor, nightstand, sofa, table, and toilet, as shown in Fig. 5. Note that the objects in this dataset are aligned in orientation.
Examples of categories in the ModelNet10 dataset.
In our experiments, we employed balanced and unbalanced datasets. The partition in the unbalanced set was the same as that of the original ModelNet10 dataset. For the balanced set, we selected 106 samples from the original training set of 10 categories. The remainder of the samples in the original training set were added to the test set. Table 1 shows the partition of training and test sets for each category in the balanced and unbalanced datasets. The learning rate for all fully connected layers and all convolutional layers was set to 0.0001. The experiments were performed on a computer with an Intel Xeon CPU E5- 2670 v3 @ 2.30 GHz, 56 GB RAM, using the Windows 10 operating system.
Partition of training sets and test sets
Classification accuracy of each category on test datasets
Confusion matrix of classification results on balanced set.
Confusion matrix of classification results on unbalanced set.
We extracted x, y, z coordinate information from the original ModelNet10 Object File Format files and saved the information in individual files that included the coordinate information of a 3D object. In both the training and testing processes, we input the files one at a time. Table 2 shows the classification accuracy of each category on the balanced and unbalanced datasets. With the proposed network, the average accuracy reached 87.07% on the unbalanced dataset after 90 iterations on the training dataset and 85.34% on the balanced set after 140 iterations on the training dataset. The proposed network had a better performance on the unbalance dataset because more samples were used for training.
Figs. 6 and 7 show the classification result confusion matrixes for the balanced and unbalanced datasets, respectively. Note that the values in the matrixes are the number of examples, not percentages. As shown in Fig. 6, the network confused samples in the table and desk classes on the balanced dataset. We thought this occurred because the objects had a similar shape. However, there were some exceptions, e.g., the object in Fig. 5(i) is a round table. As mentioned in Section 2, global features have difficulty classifying objects with similar shape. However, on the unbalanced set, the network misclassified most samples in the desk category as sofa and nightstand. We considered that this result occurred because there are a greater number of samples in the sofa category in the unbalanced training dataset. Another reason for the misclassification was information loss in the point cloud data. The information loss made objects in the desk category share more similarities with samples in the sofa and nightstand categories. This also explained why sofa and nightstand were misclassified as desk (second and third place) on the balanced set.
We compared our results on the unbalanced set against PointNet structure [9] and 3D ShapeNets developed by Wu et al. [34], as shown in Table 3. Compared with these previous studies, the classification accuracy in our work increased by 9.47% and 3.53%, respectively. Compared to the PointNet structure [9], the iteration times on the training set also decreased from 200 to 90.
Classification accuracy on test set
Fig. 8 is a line chart of the average accuracies for the test set under different training iterations. Here, the horizontal axis is iteration times, and the vertical axis is average accuracy. We found that our proposed network outperformed the PointNet structure [9] after 20 iterations (81.91%). The average accuracy of the proposed method is almost the same as that of the method developed by Wu et al. [34] after 30 iterations.
Testing accuracy under different iterations.
5. Conclusion
This paper proposed a pointwise CNN for 3D point cloud classification that consumes raw point cloud data directly. The proposed system employs four convolutional layers and a max pooling layer to extract features from origin point cloud data. Then, four fully connected layers are employed for classification. The system was tested on the ModelNet10 dataset and achieved average accuracy of 87.07% on the unbalanced training set after 90 iterations and 85.34% on the balanced training set after 140 iterations. Average accuracy improved 9.47% compared to the existing PointNet structure with fewer iterations. However, the proposed network confused objects with similar shape, particularly objects in the desk, nightstand, and table categories. We believe this occurred because we employed global features for object classification and due to information loss in the point cloud data. In future, to improve accuracy, we will consider implementing local features for classification.
Acknowledgement
This research was funded by the MSIT (Ministry of Science, ICT), Korea, under the High-Potential Individuals Global Training Program (No. 2020-0-01576) supervised by the IITP (Institute for Information Communications Technology Planning Evaluation), National Nature Science Foundation of China (No. 61503005), the Great Wall Scholar Program (No. CITTCD20190305), and NCUT funding (No. 110052972027/008).