Article Information
Corresponding Author: Ai-jun Xu* (372157342@qq.com)
Xin-mei Wu*, School of Information Engineering, Zhejiang Agriculture and Forestry University, Hangzhou, China; Zhejiang Provincial Key Laboratory of ForestryIntelligent Monitoring and Information Technology, Hangzhou, China; Key Laboratory of State Forestry and Grassland Administration on ForestrySensing Technology and Intelligent Equipment, Hangzhou, China, 956617584@qq.com
Fang-li Guan**, State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China, ajxu_zafu@163.com
Ai-jun Xu*, School of Information Engineering, Zhejiang Agriculture and Forestry University, Hangzhou, China; Zhejiang Provincial Key Laboratory of ForestryIntelligent Monitoring and Information Technology, Hangzhou, China; Key Laboratory of State Forestry and Grassland Administration on ForestrySensing Technology and Intelligent Equipment, Hangzhou, China, 372157342@qq.com
Received: July 30 2018
Revision received: December 11 2018
Accepted: December 30 2018
Published (Print): February 29 2020
Published (Electronic): February 29 2020
1. Introduction
As a key parameter in object positioning, distance measuring has been widely studied in many areas, like 3D reconstruction, new military technology for high technique weapons, and so on [1-4]. Traditional ranging method, such as tape and total station, are time and labor intensive and inefficient. With the development of laser radar and machine vision, non-contact measurement methods have emerged [5,6]. These methods are mainly divided into active and passive ranging [7-9]. Laser scanning is one of the mainstream active ranging methods [10,11]. It has a higher measurement accuracy and can be used to describe the 3D structure of one object [12]. However, for general public who are not expert in this field, this kind of active measurement instrument is limited. It requires expert knowledge, which limits its’ use in daily practice. Passive ranging can also be realized based on machine vision. It estimates distance and obtains object size through image pixel information and camera imaging principles [13,14]. It has the advantages of rich image information and low cost. Machine vision measurement includes both monocular vision and binocular vision measurements [15,16]. The early image information extraction methods were mostly based on the binocular stereo vision principle or camera motion information, and required multiple images to extract the depth [17-19]. In contrast, monocular vision method does not require strict hardware conditions during image acquisition and allows for device integration. Recently, researches on information extraction based on monocular vision have been gradually progressed. Liu et al. [20] designed a method to extract depth map from video based on non-parametric fusion of multiple cues. This method combined with clues, including the image contour, geometrical perspective and space-time correlation among contours to estimate a more accurate video depth. The depth information of the whole image could be obtained by monocular depth clues, and the algorithm did have a simple structure. However, its application might be limited because it needs prior information such as the scene structure of the image.
The calibration methods based on monocular vision system, which involve the camera’s intrinsic and extrinsic parameters, can also be used to obtain the depth information [21-23]. When combined with a camera projection model, camera calibration can be used to study the conversion relationship between the image coordinate system and the world coordinate system. This method requires more than three checkerboard images in different orientations, and records the corresponding coordinate of each point in the world coordinate system and image coordinate system. Thus, calibration has a great influence on measurement accuracy. Wu et al. [24] established a mathematical model to fit the mapping relationship between the object distance and pixel, and used this relationship to extract depth. The accuracy of this method may be affected by long-distance measurement and data fitting. Huang et al. [25] proposed a method to obtain the depth information by detecting the corner point of the vertical checkerboard image and establishing the mapping relationship between ordinate pixel and actual imaging angle. Because different cameras have different intrinsic parameters, the model established by this method had poor applicability and could not calculate the target distance in any direction.
Based on the above analysis of depth extraction and distance measurement methods, given the target contour, we present a method for depth estimating and passive ranging. To investigate the mapping relationship between the ordinate pixel of the image point and actual imaging angle of its corresponding object point, we do the following works: first, we combine the corner detection method proposed by Andreas Geiger and the cornerSubPix() function provided by OpenCV to extract sub-pixel corners. Then, the Pearson correlation analysis is used to verify the relationship between the actual imaging angle of the object point and ordinate pixel of the corresponding image point for different models of cameras and rotation angles. Experiment results show that given the same abscissas, the ordinatesis of the image points linearly related to their actual imaging angles. So, according to this principle, the actual imaging angles and ordinate pixels of the special conjugate points are substituted into the assumed linear function. Then we can calculate the constant coefficients. And a depth extraction model suitable for different models of smartphones is established. Furthermore, by substituting the intrinsic parameters of the camera and the ordinate pixel of the target point into the camera calibration model, we can calculate the depth of any image point. Finally, the vertical distance from target object to optical axis of the camera is calculated by the principle of camera stereo imaging system, and the distance from target object to the camera is calculated according to Pythagorean theorem. The main difference between our study and the others mentioned above can be summarized as: it can measure depth by a smartphone which is portable and flexible. It leads this method to be practicable in daily works. The research is also of great significance for the autonomous obstacle avoidance and path planning of unmanned vehicles in horizontal roads, remote monitoring of unmanned sweeping vehicles, and automatic measurement of tree factors in forestry resource survey.
2. Principle of Passive Ranging based on Monocular Vision System
The image is collected by camera of smartphone. To calculate the distance from any point on the horizontal ground to the camera, we first use Pearson correlation analysis method to prove that given the same abscissas, the ordinatesis of the image points linearly related to their actual imaging angles. Then, we establish a depth extraction model by assuming a linear function and substituting the actual imaging angles and ordinate pixels of the special conjugate points into the function. Furthermore, by substituting the intrinsic parameters of camera obtained from camera calibration and the ordinate pixel into that model, we can calculate the depth of target object. According to the camera imaging principle, the vertical distance from any point to the optical axis can be calculated. Finally, the Pythagorean theorem is used to derive the distance from any points to the camera plane.
Projection geometry model of shoot.
The projection geometry model of image acquisition is shown in Fig. 1. Where θ denotes half of the camera’s field of view, f denotes the focal length, and h denotes the height of camera. The camera rotation angle derived from the gravity sensor of the smartphone. α denotes the object’s actual imaging angle. D denotes the depth of a target object. As is shown in Fig. 1, ignoring the nonlinear distortion, the depth of any target object can be derived:
The diagram of each coordinate system in the Pinhole model.
Fig. 2. shows the relationship of each coordinate system in the pinhole model. To calculate the distance from any point to the camera plane, according to the depth that has been calculated above, we also need to calculate the vertical distance [TeX:] $$T_{x}$$ from the target object to optical axis (in Fig. 2, [TeX:] $$T_{x}$$ denotes the distance from the target object to its corresponding virtual object placed on the optical axis). Then [TeX:] $$T_{x}$$ can be expressed as:
where d denotes the parallax between the target object and its corresponding virtual object in the image plane. According to the Pythagorean theorem, we can calculate the distance from the target object to camera L:
3. Checkerboard Design and Corner Detection
3.1 Design of Checkerboard
While detecting and extracting corners, a perspective transformation may lead to an inaccurate corner detection and extraction results. To counter this problem, based on a traditional checkerboard [26], we improve it with a fixed cell width and increased length. Empirically, the checkerboard is found to be sufficient for corners detection over a wide range of perspective transformations.
We extract the corner of the traditional checkerboard tiled on the ground horizontally which has equal length and width, and analyze the relationship between the distances of each two adjacent rows and ordinate pixels of the corners in the same line. Then we can calculate the increment cell length according to the relationship. The two adjacent row of the new checkerboard has an equal pixel difference in the image (the image is acquired when the camera rotation angle equals 0). This checkerboard can improve the accuracy of long-distance corner extraction with a wide range of perspective transformation.
In order to calculate the length increment between two adjacent rows, we design six groups of experiments and extract the traditional checkerboard that contains [TeX:] $$45 \times 45$$ mm cells. Then we calculate the actual distance of the unit pixel between two adjacent rows in the world coordinate system. To make sure the same ordinate pixel differences of two adjacent rows, the length of each cells [TeX:] $$y_{i}$$ in new checkerboard is shown in Table 1. Let [TeX:] $$x_{i}$$ be the distance from the ith row of corners in traditional checkerboard to the camera, and the length difference [TeX:] $$\Delta d_{i}$$ of two adjacent rows can be expressed as:
Suppose the relationship between [TeX:] $$y_{i} \text { and } x_{i} \text { is } f(x)$$ ,according to formula (4), we can get:
Computing length of each grid
According to Pearson correlation analysis, there is a highly significant linear correlation between the length of each cells in new checkerboard and the distance from the ith row of corners in traditional checkerboard to the camera [TeX:] $$(p<0.01),$$ and the correlation coefficient r is 0.975. The least squares method is used to calculate the derivative of [TeX:] $$f(x), f(x)=0.262.$$
Therefore, when the checkerboard’s first row contains [TeX:] $$d^{*} d \mathrm{mm}$$ cells, the remaining rows are fixed in width, and the length difference [TeX:] $$\Delta d$$ between two adjacent rows is [TeX:] $$0.262 \times d \mathrm{mm}.$$ The new checkerboard is shown in Fig. 3. Corners of this checkerboard can be accurately extracted. Furthermore, the influence of the perspective transformation can be reasonably avoided.
Implementation process of corner detection algorithm.
3.2 Corner Detection Algorithm
While taking photos on the horizontal ground, due to the perspective transformations, traditional corner detection algorithms, such as Harris and Stephens [27] and Shi [28], are poor in robustness. Additionally, these methods also fail to detect corners when the smartphone rotates at a large angle. Therefore, we optimize Geiger’s corner detection method [29] to extract sub-pixel corners. The corner detection algorithm implementation process is shown in Fig. 4.
The algorithm does not require the size of cells and checkerboards when detect corners, and it is robust enough when extract corners from images with high distortion. The corner extraction results are shown in Table 2.
Result of sub-pixel corner point detection
4. Passive Ranging Model based on Monocular Vision
4.1 Correlation Analysis
Three smartphones, like Xiaomi, Huawei, and iPhone are selected to analyze the relationship among actual imaging angle of object point , the ordinate of image point v, and rotation angle of camera . The camera rotation angle are set as [TeX:] $$-10^{\circ}, 0^{\circ}, 10^{\circ}, 20^{\circ}, 30^{\circ},$$ respectively. The corner detection algorithm mentioned in Section 3 is used to extract pixels, and we use SPSS version 22 do regression analysis based on these data. The results are shown in Fig. 5: Fig. 5(a) shows the relationship of the ordinate pixels and actual imaging angles for three different models of smartphones when [TeX:] $$\beta=10^{\circ};$$ Fig. 5(b) shows the relationship between ordinate pixel values and imaging angles for different camera rotation angles.
As can be seen from Fig. 5, the actual imaging angle of the object point decreases as the ordinate pixel of the corresponding image point increase. And for varying rotation angles and smartphones, the relationship between ordinate pixel and actual imaging angle are different. Additionally, given the same abscissas, the ordinatesis of the image points linearly related to their actual imaging angles, where [TeX:] $$p<0.01$$ and the correlation coefficient [TeX:] $$r \geq 0.99.$$
Relationships of image ordinate pixels and actual imaging angles: (a) for three different models of smartphones and (b) for different camera rotation angles.
4.2 Passive Ranging Method
4.2.1 Camera intrinsic parameters acquisition
In photogrammetry, to determine the projection transition between the coordinate systems in the pinhole model, it is necessary to use camera parameters to construct a projection geometric model. We combine Zhang’s calibration method [30] and a camera calibration model with nonlinear distortion term to calibrate camera of smartphone. It can correct nonlinear distortions and acquire camera intrinsic parameters.
According to the pinhole camera imaging principle, image points have the following relationship in the image plane coordinate system and the pixel coordinate system:
where [TeX:] $$d_{x}, d_{y}(\text { unit: } \mathrm{mm})$$ denotes the length and width of pixel on the image plane, respectively. Since a pixel projected on the image plane is a rectangle, the length and width of each physical pixel cannot be kept consistent, [TeX:] $$d_{x}$$ is not equal to [TeX:] $$d_{y^{*}}\left(u_{0}, v_{0}\right)$$ denote the origin o of the image plane coordinate system in the pixel coordinate system. In the camera coordinate system, point [TeX:] $$P_{c}\left(X_{c}, Y_{c}, Z_{c}\right)$$ is projected on the image coordinate system [TeX:] $$(x, y, f).$$ The image plane is perpendicular to the optical axis, and the distance from the origin to the image plane is f. According to the principle of similar triangles, we get:
The transformation from the world coordinate system [TeX:] $$P_{W}\left(X_{W}, Y_{W}, Z_{W}\right)$$ to camera coordinate system [TeX:] $$P_{c}$$ is a rigid body motion, including translation and rotation. So from world coordinate system to camera coordinate system:
Combining Eqs. (6) to (8), the relationship of the coordinate system can be expressed by homogeneous coordinates and matrix as:
where [TeX:] $$\boldsymbol{M}_{\mathrm{int}}$$ denotes the camera intrinsic parameters and [TeX:] $$\boldsymbol{M}_{\mathrm{ext}}$$ denotes the extrinsic parameters. Camera external parameters include rotation matrix R and translation matrix T.
4.2.2 Depth extraction model
For different models of smartphone and camera rotation angles, the image points’ ordinates and the actual imaging angles of the corresponding object points are extremely negatively linearly related. Thus, we get:
The constant coefficients k and b are related to the camera rotation angle . The camera projection geometric model is shown in Fig. 6. As can be seen from Fig. 6, when an object point is projected on the bottom of image, its takes the minimum value [TeX:] $$90-\theta-\beta$$ , while v takes the effective number of pixels in column coordinates of the image sensor. Then, we have:
Projection geometry model of shoot: (a) projection geometry model of shoot with FOV above horizontal line and (b) projection geometry model of shoot with FOV under horizontal line.
When [TeX:] $$\alpha_{\min }+2 \theta>90^{\circ},$$ the field of view (FOV) of the camera is above horizontal line—projection geometry model of shoot is shown in Fig. 6(a), takes the maximum value [TeX:] $$90^{\circ}$$ , v infinitely close to [TeX:] $$V_{0}-\tan \beta^{*} f$$ . If the camera rotates counterclockwise, and v takes the same values. Additionally, when [TeX:] $$\alpha_{\min }+2 \theta<90^{\circ},$$ the FOV is lower than the horizon—projection geometry model of shoot is shown in Fig. 6(b), the maximum of actual imaging angle [TeX:] $$\alpha_{\max }=90-\beta+\theta, \text { and } v=0$$. Therefore, substituting into formula (10) results in:
According to the construction principle of the pinhole camera, the tangent value of is equal to half the length of the camera CMOS or CCD image sensor [TeX:] $$L_{\mathrm{CMOS}}$$ divided by the camera focal length f. The physical unit is converted into pixel units to calculate :
Therefore, combining (5)~(8), [TeX:] $$F(\alpha, \beta)$$ can be obtained:
The imaging principle of smartphone’s camera lens is pinhole imaging whose object point, image point as well as camera optical center are in one line. However, because of the manufacturing error, it is actually not an ideal linear model that lead to nonlinear distortion of the image and [TeX:] $$\delta$$ in Eq. (14) is its distortion parameter.
Then, the depth extraction model can be established by combining formula (14) and (1):
4.2.3 Distance measurement
Based on the depth of the target object derived from above, we also need to calculate the vertical distance [TeX:] $$T_{x}$$ from the target object to the optical axis. Fig. 7 is a schematic diagram of a camera stereo imaging system, where point P denotes the camera position, and line segment AB is parallel to the image plane. Let coordinates of A be (X, Y, Z) in the camera coordinate system. And the coordinates of point B are [TeX:] $$\left(X+T_{x}, Y, Z\right)$$ in the camera coordinate system. Points A and B are projected on the image plane, where [TeX:] $$A^{\prime}\left(x_{l}, y_{l}\right), B^{\prime}\left(x_{r}, y_{r}\right).$$ According to formula (7):
Combining Eqs. (6) and (16), The horizontal parallax d of the two points [TeX:] $$A^{\prime} \text { and } B^{\prime}-$$ with the same Y and depths—can be expressed as:
Therefore, given camera focal length f, image center point [TeX:] $$\left(u_{0}, v_{0}\right),$$ the physical size [TeX:] $$d_{x}$$ of each pixel in the x-axis on the image plane and depth of the target object, the vertical distance [TeX:] $$T_{x}$$ from the target object to the optical axis can be calculated:
According to formula (3), we can obtain the distance L from the target object to the projection point of the camera:
Principle of camera stereoscopic imaging system.
5. Experiment Result and Discussion
To verify the feasibility and accuracy of the passive ranging method, we conducted experiments used Xiaomi 3 (MI 3) smartphone. Java combined with C++ were used to write a passive ranging application for smartphones. After the application was written and debugged according to the above method, the accuracy of the depth extraction model and passive ranging were verified separately in the laboratory and natural environment.
The intrinsic parameters of the camera are: [TeX:] $$f_{x}=3486.5637, u_{0}=1569.0383, f_{y}=3497.4652, v_{0}=2107.98988,$$ and the image resolution is [TeX:] $$3120 \times 4208$$ . Substituted the parameters into the model to get the specific depth extraction model of the camera:
Data of depth measurement
5.1 Ranging in Laboratory
In experiment 1, the camera rotation angle [TeX:] $$$\beta$ was $0^{\circ}$$$ . In group [TeX:] $$I_{1},$$ the height of camera [TeX:] $$h_{1}$$ was 305 mm. In group [TeX:] $$I_{2},$$ the height of camera [TeX:] $$h_{2}$$ was 285 mm. The corners pixels were extracted, and their actual imaging angles and depths were calculated based on the depth extraction model and ordinate pixels. The experimental data are shown in Table 4. The true depth was measured by a tape. The actual imaging angle of the corner can be calculated according to the cosine value of it, which equal to the actual depth divided by height. And the relative error was obtained by dividing the absolute error (the difference between the calculated depth and the true depth) by the true depth.
From Table 4, we can conclude that the relative error of the depth calculated by depth extraction model does not exceed 3%. The average relative error of depth is 0.93% when the distance is from 0.5 to 2.6 meters. The errors of the depth extracted by depth extraction model may related to many factors, such as the accuracy of the image processing algorithm, different light conditions or some other factors. In addition, due to the nonlinear distortion of camera lens, the closer the target object is to the optical axis of the camera, the smaller the image distortion error and the more accurate the measurement, and vice versa. However, from Table 4 we can conclude that in a certain rang, the measurement error is random, and is acceptable in our next tree DBH and height measurement works.
In experiment 2, the camera rotation angles of experimental groups [TeX:] $$I_{1}, I_{2}, I_{3}, I_{4}, I_{5} \text { were }-10^{\circ}, 0^{\circ}, 10^{\circ},20^{\circ} \text { and } 30^{\circ},$$ respectively, the height of camera [TeX:] $$h_{1}$$ was 408 mm. We also calculated the relative error root mean square (rRMS) of depth D, vertical distance [TeX:] $$T_{x}$$ and distance L under different camera rotation angles. Experimental data is shown in Table 5.
Root mean square of the relative error of the measured values with different camera rotation angles
The experimental results show that when the camera rotates counterclockwise, the relative error RMS of the depth D, the vertical distance [TeX:] $$T_{x}$$ and the distance L were relatively larger. Otherwise, the relative errors RMS were smaller. It is because that once the smartphone camera is rotated clockwise, the range of imaged ground will be far away from the centre of the image and more closer to the bottom of the image, where the nonlinear distortion is larger. It is beneficial to improve the measurement accuracy when we collect image by rotated the smartphone clockwise. The measurement error was also affected by the height of camera, camera intrinsic parameter accuracy and so on.
5.2 Ranging in Nature Environment
To verify the accuracy of the passive ranging method in nature scene, we took five images by smartphone camera and each image contained three target objects. In experiment 3, the camera rotation angle was [TeX:] $$O^{O},$$ the height of camera h was 1285 mm. Experimental data are shown in Table 6. The experimental results showed that the relative errors of this method were no more than 6%, while its average relative error was 1.71% within a range of 3–10 m. Sheng et al. [31] developed an underwater binocular vision ranging system with an average relative error of 2.34%, Zou and Yuan [32] achieved a relative error less than 10% of the passive ranging based on monocular vision. Therefore, compare with other passive ranging methods based on machine vision, this method had a relative higher measurement accuracy. In addition, our method is not as accurate as it reported by Huang et al. [25] (a relative error less than 3%). However, compare with this method, we should not to simulate linear relation for all kind of cameras, different camera rotation angles or heights of camera.
The accuracy of this passive ranging method may directly determined by the accuracy of depth extraction model and [TeX:] $$T_{x}$$ measurement result.
Measurement accuracy of target object distance
6. Conclusion and Future Work
In this paper, we present a depth extraction model and passive ranging method based on monocular vision system using smartphone. First, we use an optimized corner extraction algorithm to detect and extract the sub-pixel corners of a checkerboard with a fixed width and increased length, and investigate the linear relationship of the actual imaging angle of the object point and the ordinate pixel of the corresponding image point with different camera rotation angles. It is verified that given the same abscissas, the ordinatesis of the image points linearly related to their actual imaging angles [TeX:] $$(p0.01,r \geq 0.99).$$ Therefore, by assuming a linear function and substituting the actual imaging angles and the ordinate pixels of the special conjugate points (maximum and minimum values) into the linear relationship function, we establish a depth extraction model suitable for various of smartphones. What’s more, an improved camera calibration model with a nonlinear distortion term is used to obtain the distortion parameters and intrinsic parameters of camera, and the intrinsic parameters are used to calculate the depth of the target object. According to the principle of camera stereo imaging system, we calculate the vertical distance from the target object to the camera optical axis, and range the distance by Pythagorean theorem. To verify the accuracy of the model, we conduct two sets of experiments in both close and long-distance ranging in the laboratory and nature environment. The experimental results show that the average relative error of the depth measurement is 0.937% when the distance is within 0.5–2.6 m. What’s more, the relative error of measurement is 1.71% when the distance is 3–10 m. Therefore, using this method to measure distance has a high measurement accuracy.
Compared with other passive ranging methods, this method uses a smartphone to measure distance and extract depth which is convenient, portable and easy to be used in daily practice. It does not require a large scene calibration site and avoids errors caused by data fitting. In addition, we only need to obtain the intrinsic parameters by camera calibration at the first time, and then we can calculate the distance from the target object to the camera by a single image. It does not require any calibrations or known dimension objects to be placed in the measuring scene. However, when the target object to be photographed is far away from the camera, due to the perspective transformations, the detection accuracy of its contour is reduced and the measurement accuracy may also be affected. To solve this problem, in the next step we will devote to use the deep learning method to detect and extract a more specific object contour. Moreover, this technique can further be used as the basic of an object’s height and width measurement. Therefore, in the future of our work, we will also engage to use this method to measure the 3D information of an object.
Acknowledgement
This work was supported by the National Natural Science Foundation of China (No. 31670641, The research of tree’s height and DBH measurement method based on the intelligent mobile terminal) and Zhejiang Science and Technology Key RD Program Funded Project (No.2018C02013).