## Hui Zeng* , Qi Wang* , Chen Li** and Wei Song**## |

Conv1 | Conv2 | Conv3 | Conv4 | Conv5 | Fc6 | Fc7 | Fc8 |
---|---|---|---|---|---|---|---|

96×11×11 | 256×5×5 | 512×3×3 | 512×3×3 | 512×3×3 | 4096 | 4096 | classes |

stride=2 | stride=2 | stride=1 | stride=1 | stride=1 | dropout | dropout | softmax |

pad=1 | pad=1 | pad=1 | pad=1 | pad=1 | - | - | - |

2×2 pool | 2×2 pool | - | - | 2×2 pool | - | - | - |

In this paper, the purpose of view-pooling layer is to convert the multi-view feature maps to a single and effective descriptor. From [11], we can conclude that the method with element-wise maximum pooling across the multiple views works better than the method without view-pooling layer. And the element-wise maximum pooling is more effective than the element-wise mean pooling. However, neither of the above two kinds of pooling methods may be qualified to be optimal. Selecting the maximum or average as activation of view-pooling layer can result in a loss of significant information. And the model will fall into overfitting to a great extent. In allusion to these issues, we put forward the LMPF method. It can aggregate the features of multiple views by learning a set of optimal weights and can fuse different view-pooling methods effectively. In other words, it introduces the “learning” ability into the previous hand-crafted view-pooling method, ensuring that the training error is minimized throughout the whole training phase. Fig. 2 presents the illustrations of three kinds of view-pooling methods. Fig. 2(a) shows the max pooling method in view-pooling layer, which perform the maximum operation. The red rectangle represents the max value in the corresponding area of all n views. Fig. 2(b) shows the mean pooling method in view-pooling layer, which perform the average operation. The red rectangles indicate that all elements of the same pooled area are involved in the operation. Fig. 2(c) shows the LMPF method, which is a combination of maximum operation and average operation by setting a set of learnable weights.

As is shown in Fig. 2, we design our view-pooling method based on the max pooling method and the mean pooling method. These two kinds of methods act similarly with the traditional max or average operation in the pooling layers, and the only difference is that the pooling region is changed from an area in one feature map to a set of corresponding elements across multi-views (the sub-rectangles of each view in Fig. 2). Suppose that the last feature maps for view-pooling are [TeX:] $$\left[m_{1}, m_{2}, \ldots, m_{n}\right],$$ and is defined as the quantity of views. The value of a certain point on feature map [TeX:] $$m_{k}$$ can be written as [TeX:] $$\alpha_{k}(p, q),$$ in which p represents the abscissa and q represents the ordinate. Then for the max pooling method in view-pooling layer, the corresponding output [TeX:] $$O_{\max }(p, q)$$ of the location (p,q) is the maximum of [TeX:] $$\left[\alpha_{1}(p, q), \alpha_{2}(p, q), \ldots, \alpha_{n}(p, q)\right].$$ Then the expression for [TeX:] $$o_{\max }(p, q)$$ appears as shown below:

For the mean pooling method in view-pooling layer, the corresponding output [TeX:] $$o_{\operatorname{man}}(p, q)$$ at location is the mean value of (p,q)[TeX:] $$\left[\alpha_{1}(p, q), \alpha_{2}(p, q), \ldots, \alpha_{n}(p, q)\right].$$ [TeX:] $$O_{\operatorname{mean}}(p, q)$$ is as below:

For our proposed learning based view-pooling method, the output can be expressed by weighted sum of the results obtained from max pooling method and mean pooling method in view-pooling layer.

where [TeX:] $$w_{1} \text { and } w_{2}$$ are the weights of max pooling and mean pooling. The weights are initialized by small random values, as long as their sum is guaranteed to be 1. And the standard back propagation (BP) algorithm is applied to search optimization weights in the whole training phase. From Eq. (3) we can conclude that our method is a fusion of the max pooling strategy and the mean pooling strategy for multiple feature maps, and its purpose is to select a set of optimum values for [TeX:] $$w_{1} \text { and } w_{2}$$ by learning in the end-to-end training phase. So the method that we proposed can combine max pooling and mean pooling effectively, and it can reduce information loss in the view-pooling stage.

In summary, the implementation procedures of our experiments can be recapitulated as follows:

1) **Input:** Generate multiple projected images of each 3D model in dataset as the input of the MVCNN model.

2) **Initialization:** Initialize each convolutional layer of MVCNN randomly, and set proper values to the related parameters, such as learning rate, momentum, the weights [TeX:] $$\left[\omega_{1}, \omega_{2}\right]$$ of LMPF and so on.

3) **Training phase:** Choose the SGD algorithm to fine-tune the MVCNN model on the training dataset. Then we can obtain the optimal values of the weights.

4) **Classification/Retrieval:** Use the linear SVMs method [21] to classify and the [TeX:] $$L_{2}$$ distance [6] to retrieve on testing dataset.

To validate the accuracy and feasibility of LMPF objectively, two datasets named ModelNet40 [22] and McGill [23] are used in our experiments. All CNN models in our experiments are built by MatConvNet toolbox [24]. Our experimental environment is MATLAB R2014a based on i7-6700 CPU 3.40 GHz 12.0G memory Lenovo computer. To analyze the experimental results comprehensively, we choose the following indicators to measure the performance of classification and retrieval: accuracy for classification, mean average precision (mAP), nearest neighbor (NN), the first tier (FT), the second tier (ST) and discounted cumulative gain (DCG) for retrieval.

In our experiments, we first obtained the multiple projected images of 3D models in different views. Then we used the LMPF based MVCNN to perform 3D classification/retrieval experiments. The detailed steps have been listed in Section 3.3. Finally, we testified the effective of LMPF method, and compared it with other methods. The momentum of MVCNN is set to 0.5, and the initial values of [TeX:] $$\left[\omega_{1}, \omega_{2}\right]$$ is initialized as [1,0] . In the process of learning, the view-pooling layer is placed after Conv5. And we adopt the SGD algorithm to satisfy the update of parameters in training phase.

ModelNet40 is a subset of ModelNet which is published on the Princeton ModelNet website [22]. This dataset contains a total of 12,311 well-annotated shapes from 40 common categories. We construct the training dataset and testing dataset of ModelNet40 in accordance with the study [11]. Fig. 3 shows some sample models of ModelNet40.

We use the 1st camera setup mentioned in [11] to obtain the multiple projected images as the inputs of the MVCNN model. The 1st camera setup requires that the input 3D model is placed vertically according to a constant axis (most 3D model datasets conform to this assumption, including ModelNet40). For each 3D model, there are 12 virtual cameras in an interval of 30º placed around it. And each of the cameras aims at the center of the model with a 30º angle to the horizontal. In this case, we can capture 12 views of each 3D model. An illustration is provided in Fig. 4.

At first, we make a contrast between our proposed LMPF method and the other two hand-crafted approaches (max pooling and mean pooling). Table 2 summarizes our 3D model classification/retrieval results on ModelNet40 dataset with three kinds of view-pooling methods. Obviously, our LMPF method in Table 2 shows the best performance both in classification and retrieval. It outperforms mean pooling by nearly 4% in mAP and max pooling by nearly 1%. All in all, we can conclude that LMPF can decrease the information loss effectively on account of its ‘learning” ability in training phase.

Then we perform experiments on ModelNet40 dataset to make the contrast of our method with other classification methods. Table 3 gives the comparative classification/retrieval results on ModelNet40 dataset. It is clear that our method outperforms the others by nearly 10%–20% in classification accuracy and 20%–40% in mAP. These results further verify the conclusion that LMPF can boost the performance effectively.

Table 2.

Pooling method | Classification | Retrieval | ||||
---|---|---|---|---|---|---|

Accuracy (%) | mAP (%) | NN (%) | FT (%) | ST (%) | DCG (%) | |

Mean pooling | 88.00 | 64.40 | 86.38 | 64.42 | 75.25 | 87.63 |

Max pooling | 89.90 | 70.10 | 88.13 | 70.81 | 79.62 | 90.11 |

LMPF | 89.90 | 71.00 | 88.88 | 71.61 | 81.08 | 90.54 |

Table 3.

Pooling method | Classification | Retrieval | ||||
---|---|---|---|---|---|---|

Accuracy (%) | mAP (%) | NN (%) | FT (%) | ST (%) | DCG (%) | |

SPH [5] | 68.20 | 33.30 | - | - | - | - |

LFD [25] | 75.50 | 40.90 | - | - | - | - |

3D ShapeNets [26] | 77.30 | 49.20 | - | - | - | - |

LMPF | 89.90 | 71.00 | 88.88 | 71.61 | 81.08 | 90.54 |

The McGill dataset is provided on the McGill 3D Shape Benchmark website, which involves a variety of 3D models [23]. In our experiments, the McGill dataset we used is formed from a set number of nonrigid 3D models selected from the above website. There are 255 models in this dataset, and they are divided into 10 classes. Each class has nearly 25 models in different posture and appearance. Fig. 4 shows a series of sample models of this dataset.

In our experiments, 24 virtual cameras are placed on the surface of the sphere surrounding the model to produce multi-view projection images. The model center coincides with the sphere center and all cameras aim at the centre of the 3D model. The location of cameras can be obtained through Isocube Spherical Map method [27] which typically contains two steps. Firstly, the sphere is divided into six equal areas. We divide the sphere into equatorial region and two polar crowns with two parallel circles, and then the equatorial region is divided into four symmetrical regions. It is shown that these six regions are equal in size (refer to Ref. [27] for mathematical proof). Secondly, we subdivide each area with different accuracy of N to generate many smaller areas of equal size. Then the cameras are placed in the center of each small area. In this paper, we choose N = 2 as the segmentation accuracy, yielding total 24 views per model.

Similar to the experiments on ModelNet40 dataset, firstly, a series of contrastive experiments are performed among the above three view-pooling methods. All the outcomes of experiments are recorded in Table 4. According to Table 4, it is obvious that LMPF achieves the best results whether in classification or retrieval, and outperforms the max pooling method and the mean pooling method by nearly 1%–4% in classification accuracy and 2%–5% in retrieval measures. Then we make a comparison between LMPF and other retrieval methods of McGill dataset, and summarize the retrieval indicators in Table 5. Through the anatomization of Table 5, we can verify the conclusion that LMPF method really outperforms other methods in the domain of 3D model retrieval. In summary, it is obvious that LMPF strategy is a relatively better method for MVCNN compared to those hand-crafted methods.

In summary, we have proposed an ingenious view-pooling method named Learning-based Multiple Pooling Fusion (LMPF) in our work. And on the basis of multiple experiments, it is verified that this method can be successfully applied to the MVCNN model. At first, we generate multiple projected images of 3D models and use them as the inputs of the MVCNN model. Secondly, initialize each convolutional layer of MVCNN randomly, and set proper values to the related parameters. Then fine-tune the network by SGD algorithm, so that we can get a group of optimal weights for MVCNN model. Finally, the linear SVM is used for classifying and the [TeX:] $$L_{2}$$ distance is used for retrieving. The results show that LMPF has more efficient performance than traditional hand-crafted view-pooling methods. So in general, the LMPF method that we proposed in this paper combines the learning-based pooling method and the handcrafted pooling method, and can decrease the information loss effectively. In the future, we will further optimize the architecture of the MVCNN and investigate more effective view-pooling methods.

She received B.S. and M.S. degrees from Shandong University in 2001 and 2004, respectively, and received the Ph.D. degree from National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences in 2007. She is currently an associate professor at School of Automation and Electrical Engineering, University of Science and Technology Beijing, China. Her main research interests include computer vision, pattern recognition and machine learning.

She received B.S. degree from University of Science & Technology Beijing in 2016. Now she is currently a graduate student at School of Automation and Electrical Engineering, University of Science and Technology Beijing, China. Her current research direction is computer vision and pattern recognition. Hui Zeng, Qi Wang, Chen Li, and Wei Song

She received the Ph.D. in Control Science and Control Engineering from the University of Science and Technology Beijing, China, in 2013. She has been an associate professor at North China University of Technology, China, since 2017. She has long been engaged in the research and development and teaching work of image processing, pattern recognition, and information hiding.

He received his B.Eng. degree in Software Engineering from Northeastern University, Shenyang, China, in 2005 and his M.Eng. and Dr.Eng. degrees in the Department of Multimedia from Dongguk University, Seoul, Korea, in 2008 and 2013, respectively. Since September 2013, he has been an Associate Professor at the department of Digital Media Technology of North China University of Technology. His current research interests are focused on IoT, virtual reality, and multimedia technologies.

- 1 M. Ankerst, G. Kastenmuller, H. P. Kriegel, T. Seidl, "3D shape histograms for similarity search and classification in spatial databases,"
*in Advances in Spatial Databases. Heidelberg: Springer*, pp. 207-226, 1999.doi:[[[10.1007/3-540-48482-5_14]]] - 2 M. T. Suzuki, T. Kato, N. Otsu, "A similarity retrieval of 3D polygonal models using rotation invariant shape descriptors," in
*Proceedings of 2000 IEEE International Conference on Systems*, Man and Cybernetics, Nashville, TN, 2000;pp. 2946-2952. custom:[[[-]]] - 3 R. Osada, T. Funkhouser, B. Chazelle, D. Dobkin, "Shape distributions,"
*ACM Transactions on Graphics (TOG)*, vol. 21, no. 4, pp. 807-832, 2002.doi:[[[10.1145/571647.571648]]] - 4 B. K. P. Horn, "Extended Gaussian images," in
*Proceedings of the IEEE*, 1984;vol. 72, no. 12, pp. 1671-1686. custom:[[[-]]] - 5 M. Kazhdan, T. Funkhouser, S. Rusinkiewicz, "Rotation invariant spherical harmonic representation of 3D shape descriptors," in
*Proceedings of the 2003 Eurographics Symposium on Geometry Processing*, Aachen, Germany, 2003;pp. 156-164. custom:[[[-]]] - 6 S. K. Vipparthi, S. K. Nagar, "Color directional local quinary patterns for content based indexing and retrieval,"
*Human-centric Computing and Information Sciences*, vol. 4, no. 6, 2014.doi:[[[10.1186/s13673-014-0006-x]]] - 7 D. G. Lowe, "Distinctive image features from scale-invariant keypoints,"
*International Journal of Computer Vision*, vol. 60, no. 2, pp. 91-110, 2004.doi:[[[10.1023/B:VISI.0000029664.99615.94]]] - 8 H. Bay, T. Tuytelaars, L. Van Gool, "Surf: speeded up robust features,"
*in Computer Vision-ECCV 2006. Heidelberg: Springer*, pp. 404-417, 2006.doi:[[[10.1007/11744023_32]]] - 9 J. Zhu, R. San-Segundo, J. M. Pardo, "Feature extraction for robust physical activity recognition,"
*Human-centric Computing and Information Sciences*, vol. 7, no. 16, 2017.doi:[[[10.1186/s13673-017-0097-2]]] - 10 Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, "Gradient-based learning applied to document recognition," in
*Proceedings of the IEEE*, 1998;vol. 86, no. 11, pp. 2278-2324. custom:[[[-]]] - 11 H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller, "Multi-view convolutional neural networks for 3D shape recognition," in
*Proceedings of the IEEE International Conference on Computer Vision*, Santiago, Chile, 2015;pp. 945-953. custom:[[[-]]] - 12 G. E. Hinton, R. R. Salakhutdinov, "Reducing the dimensionality of data with neural networks,"
*Science*, vol. 313, no. 5786, pp. 504-507, 2006.doi:[[[10.1126/science.1127647]]] - 13 A. Krizhevsky, I. Sutskever, G. E. Hinton, "ImageNet classification with deep convolutional neural networks,"
*Advances in Neural Information Processing Systems*, vol. 25, pp. 1097-1105, 2012.doi:[[[10.1145/3065386]]] - 14 M. D. Zeiler, R. Fergus, "Visualizing and understanding convolutional networks,"
*in Computer Vision-ECCV 2014. Cham: Springer*, pp. 818-833. doi:[[[10.1007/978-3-319-10590-1_53]]] - 15 K. He, X. Zhang, S. Ren, J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition,"
*in Computer Vision-ECCV 2014. Cham: Springer*, pp. 346-361. doi:[[[10.1109/TPAMI.2015.2389824]]] - 16
*K. Simonyan and A. Zisserman, 2014 (Online). Available:*, https://arxiv.org/abs/1409.1556 - 17 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, "Going deeper with convolutions," in
*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, Boston, MA, 2015;pp. 1-9. custom:[[[-]]] - 18
*M. D. Zeiler and R. Fergus, 2013 (Online).*, https://arxiv.org/abs/1301.3557 - 19 Z. Zhong, L. Jin, Z. Feng, "Multi-font printed Chinese character recognition using multi-pooling convolutional neural network," in
*Proceedings of 2015 13th International Conference on Document Analysis and Recognition (ICDAR)*, Tunis, Tunisia, 2015;pp. 96-100. custom:[[[-]]] - 20 C. Y. Lee, P. W. Gallagher, Z. Tu, "Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree," in
*Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS)*, Cadiz, Spain, 2016;pp. 464-472. custom:[[[-]]] - 21 M. Zouina, B. Outtaj, "A novel lightweight URL phishing detection system using SVM and similarity index,"
*Human-centric Computing and Information Sciences*, vol. 7, no. 17, 2017.doi:[[[10.1186/s13673-017-0098-1]]] - 22
*The Princeton ModelNet (Online). Available:*, http://modelnet.cs.princeton.edu - 23
*McGill 3D Shape Benchmark (Online). Available:*, http://www.cim.mcgill.ca/~shape/benchMark - 24 A. Vedaldi, K. Lenc, "Matconvnet: Convolutional neural networks for matlab," in
*Proceedings of the 23rd ACM International Conference on Multimedia*, Brisbane, Australia, 2015;pp. 689-692. custom:[[[-]]] - 25 D. Y. Chen, X. P. Tian, Y. T. Shen, M. Ouhyoung, "On visual similarity based 3D model retrieval,"
*Computer Graphics Forum*, vol. 22, no. 3, pp. 223-232, 2003.doi:[[[10.1111/1467-8659.00669]]] - 26 Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao, "3D ShapeNets: a deep representation for volumetric shapes," in
*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, Boston, MA, 2015;pp. 1912-1920. custom:[[[-]]] - 27 L. Wan, T. T. Wong, C. S. Leung, "Isocube spherical mapping,"
*Journal of Computer-Aided Design & Computer Graphics*, vol. 20, no. 8, pp. 978-985, 2008.custom:[[[-]]] - 28 H. Tabia, H. Laga, D. Picard, P. H. Gosselin, "Covariance descriptors for 3D shape matching and retrieval," in
*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, Columbus, OH, 2014;pp. 4185-4192. custom:[[[-]]] - 29 A. Agathos, I. Pratikakis, P. Papadakis, S. J. Perantonis, P. N. Azariadis, N. S. Sapidis, "Retrieval of 3D articulated objects using a graph-based representation," in
*Proceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR)*, Munich, Germany, 2009;pp. 29-36. custom:[[[-]]] - 30 H. Tabia, D. Picard, H. Laga, P. H. Gosselin, "Compact vectors of locally aggregated tensors for 3D shape retrieval," in
*Proceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR)*, Girona, Spain, 2013;pp. 17-24. custom:[[[-]]] - 31 P. Papadakis, I. Pratikakis, T. Theoharis, G. Passalis, S. Perantonis, "3D object retrieval using an efﬁcient and compact hybrid shape descriptor," in
*Proceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR)*, Crete, Greece, 2008;pp. 9-16. custom:[[[-]]]