** An Automatic Urban Function District Division Method Based on Big Data Analysis of POI **

Hao Guo* , Haiqing Liu** , Shengli Wang*** and Yu Zhang**

## Article Information

## Abstract

**Abstract:** Along with the rapid development of the economy, the urban scale has extended rapidly, leading to the formation of different types of urban function districts (UFDs), such as central business, residential and industrial districts. Recognizing the spatial distributions of these districts is of great significance to manage the evolving role of urban planning and further help in developing reliable urban planning programs. In this paper, we propose an automatic UFD division method based on big data analysis of point of interest (POI) data. Considering that the distribution of POI data is unbalanced in a geographic space, a dichotomy-based data retrieval method was used to improve the efficiency of the data crawling process. Further, a POI spatial feature analysis method based on the mean shift algorithm is proposed, where data points with similar attributive characteristics are clustered to form the function districts. The proposed method was thoroughly tested in an actual urban case scenario and the results show its superior performance. Further, the suitability of fit to practical situations reaches 88.4%, demonstrating a reasonable UFD division result.

**Keywords:** Big Data Analysis , Dichotomy Method , Mean Shift Algorithm , POI Data , Urban Function District

## 1. Introduction

With the accelerating urbanization process, the city size has extended, inducing several problems, such as increased traffic congestion and pollution. To solve these problems, reasonable urban planning should be considered before undertaking urban development. The urban function district (UFD) is the most important concept in urban planning for describing urban structures. In detail, the UFD indicates a certain geographic area where the land use function, intensity, direction, and price are consistent, and the intensive use potential also has similar features. UFD division is highly significant to handle the evolving role of urban planning and further help in developing reliable urban planning programs [1,2].

Conventionally, UFD division is mainly based on investigative methods, such as spot surveys or questionnaires [3]. In recent years, newly developed technologies, such as remote sensing [4,5] and unmanned aerial vehicle (UAV) photography [6,7] are also being used for the task. However, these methods require vast amounts of survey data and consume significant manpower and time. In addition, the division results are affected by the personal experience of the planner and subjective factors. Furthermore, with the accelerating urbanization process, it is also difficult for planners to find newly formed function districts without obvious characteristics, rendering the traditional investigation methods unsuitable.

As a worldwide technology trend, big data analysis has been widely used in several fields and involves the discovery of useful implicit information in large datasets using data extraction, machine learning, statistics, and visualization. Compared with the traditional methods, big data analysis can provide a highly objective, reasonable, and visualized UFD division result. In recent years, several scholars have applied this new technology to UFD. Based on the literature review, recent UFD division methods based on big data analysis can be classified into three types: (1) methods based on cellular signaling data, (2) methods based on remote sensing images and point of interest (POI) data, and (3) methods based on vehicle trajectory, such as the probe data of transit vehicles and taxies.

Researchers have achieved reasonable success with these methods in recent years. For example, using cellular signaling data, Yan et al. [8] presented an urban functional area division method based on GIS clustering analysis. The method was verified using actual data from Changchun City, China, and the results showed excellent performance. Zhang et al. [9] presented a hierarchical semantic cognition (HSC) method to classify functional zones in Beijing. The proposed method relies on geographic cognition and considers visual features, object categories, spatial object patterns, zone functions, and hierarchical relations. It can produce a high overall accuracy of 90.8%. In [10], the authors divided the urban function area with a DBSCAN clustering method using the trajectory data of 4,000 taxis. In this method, the passengers’ pick-up and drop-off states are extracted from the trajectory for the division, and the total accuracy can reach 95%.

In the aforementioned works, the most commonly used methods for big data analysis are clustering, such as K-means and K-medoids. These algorithms belong to supervised classification approaches in which certain clustering centers must be presumed. Given different clustering centers, the division results are different [11]. In urban scenarios, UFDs may change with the growing urbanization, leading to generally known clustering centers, especially in a few new city areas. In addition, certain data resources are difficult to acquire and process, such as remote sensing images. Cellular signaling data involves individual privacy, and is not open to the public, considering the legal viewpoint. These shortcomings render the traditional methods inadequate for achieving high-efficiency, low-cost, and dynamic UFD division.

In this paper, an automatic UFD division method is proposed based on POI data analysis. To improve the data retrieval efficiency in case of an unbalanced distribution of data points in a geographic space, the dichotomy method was used to optimize the data extraction process by crawling on the Internet. Based on this, the UFD was divided using the mean shift method. The overall work was tested in an actual urban scenario, and the results demonstrate the credible performance of the proposed method.

The remainder of this paper is organized as follows. In Section 2, the dichotomy-based POI data extraction method is presented. In Section 3, the UFD division method using the mean shift algorithm is proposed. The case study, along with the data extraction performance and UFD division results, are presented in Section 4. In Section 5, we conclude the paper.

## 2. POI Data Extraction Method based on Dichotomy

##### 2.1 Introduction of the POI Data

POI implies a specific point location that may be considered useful or interesting by people, such as a store, hotel, campsite, fuel station, or toponym with humanistic significance [12]. The positional information for the POI data is acquired using precision positioning equipment or map orienting. The data contain vast quantities of spatial, attributive, and other useful information, such as comments about the shop and travel photos. This information provides new data support for automatic UFD division [13]. In this study, the POI data were completely acquired using a web crawler. The acquired POI data were divided into 10 categories and 68 sub-categories. The overall data distribution and categories of the POI data are presented in Fig. 1 and Table 1, respectively.

Each data item represents the name, longitude, latitude, administrative region, and other information, as shown in Table 2.

##### 2.2 POI Big Data Extraction based on Dichotomy Method

When crawling for POI data on the web, a rectangular geographic region, expressed by four coordinates, was taken as the input. In this region, the maximum tolerant crawling number of data was limited owing to the performance of the interface. In an actual scenario, the map structure is not regular, and the POI samples are not distributed uniformly. Hence, it is difficult to guarantee that the actual number of POI data in the selected rectangular geographic region exactly matches the maximum tolerant crawling number. If the configuration of a geographic region is too small or the POI samples are sparsely distributed, less than the maximum quantity of POI data is acquired for each crawling step. Consequently, the complexity of the data extraction algorithm will increase to obtain the fully sampled POI data. On the contrary, if the configuration of a geographic region is very large or the POI samples are densely distributed, the POI data may exceed the maximum tolerant crawling number. In this case, a few samples will not be successfully extracted, resulting in an incomplete data sample.

To solve the aforementioned problems, a POI data extraction optimization method based on dichotomy was proposed [14]. In this study, we constructed a two-dimensional coordinate system and defined the rectangular geographic region by four points, as shown in Eq. (1) and Fig. 2.

##### (1)

[TeX:] $$R=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{1}\right),\left(x_{1}, y_{2}\right),\left(x_{2}, y_{2}\right)\right\}$$In Fig. 2, each node in the rectangle A-B-C-D denotes a full record of the POI data. In this study, [TeX:] $$N_{\max }$$ represented the maximum number of POI data for each step of the crawling operation. Initially, the rectangle was configured to completely cover the entire geographic area for analysis. In this case, the number of POI data significantly exceeded [TeX:] $$N_{\max }$$ and the rectangle was further divided to satisfy the data extraction demand. In the initial rectangle, the midpoints of the opposite sides were line-connected to divide the rectangle into quarters. The entire geographic area can be further expressed as a total of four sub-areas, as shown in Eq. (2).

##### (2)

[TeX:] $$R=\left\{\begin{array}{l} R_{1}=\left\{\left(x_{1}, y_{1}\right),\left(\frac{\left|x_{1}-x_{2}\right|}{2}, y_{1}\right),\left(\frac{\left|x_{1}-x_{2}\right|}{2}, \frac{\left|y_{1}-y_{2}\right|}{2}\right),\left(x_{1}, \frac{\left|y_{1}-y_{2}\right|}{2}\right)\right\} \\ R_{2}=\left\{\left(\frac{\left|x_{1}-x_{2}\right|}{2}, y_{1}\right),\left(x_{2}, y_{1}\right),\left(x_{2}, \frac{\left|y_{1}-y_{2}\right|}{2}\right),\left(\frac{\left|x_{1}-x_{2}\right|}{2}, \frac{\left|y_{1}-y_{2}\right|}{2}\right)\right\} \\ R_{3}=\left\{\left(x_{1}, \frac{\left|y_{1}-y_{2}\right|}{2}\right),\left(\frac{\left|x_{1}-x_{2}\right|}{2}, \frac{\left|y_{1}-y_{2}\right|}{2}\right),\left(\frac{\left|x_{1}-x_{2}\right|}{2}, y_{2}\right),\left(x_{1}, y_{2}\right)\right\} \\ R_{4}=\left\{\left(\frac{\left|x_{1}-x_{2}\right|}{2}, \frac{\left|y_{1}-y_{2}\right|}{2}\right),\left(x_{2}, \frac{\left|y_{1}-y_{2}\right|}{2}\right),\left(x_{2}, y_{2}\right),\left(\frac{\left|x_{1}-x_{2}\right|}{2}, y_{2}\right)\right\} \end{array}\right.$$Further, each sub-area was traversed to check whether the actual number of POI nodes [TeX:] $$N_{R_{i}}$$ matched the maximum tolerant crawling number [TeX:] $$N_{\max }.$$ If [TeX:] $$N_{R_{i}} \leq N_{\max },$$ the data crawling operation was executed, and the POI data in the sub-area [TeX:] $$R_{i}$$ were extracted. Otherwise, the rectangular sub-area was further divided into quarters based on the dichotomy method. When the entire geographic area to be analyzed was covered, the data extraction process was considered complete, and the data were stored for further analysis. The design flowchart of the algorithm is presented in Fig. 3.

## 3. UFD Division Method based on Mean Shift Algorithm

Mean shift is a non-parametric and mode-seeking feature-space analysis method widely used in computer vision and image processing applications [15]. In the iterative procedure, each data point shifts to the average of its neighborhood based on the location distribution and attributive characteristics. A sample mean shift vector is described below.

Consider n data in a d-dimensional Euclidean space [TeX:] $$R^{d}, x_{i}, i=0,1, \ldots, n.$$ The sample mean at x is calculated using Eq. (3).

Here [TeX:] $$S_{h}$$ is a generalized d-dimensional sphere with radius h, as shown by Eq. (4):

k in Eq. (3) denotes the number of n data points located in the area of [TeX:] $$S_{h}$$.

Considering the weights of the distance between nodes and the linking of a node itself, the sample mean can be expanded as given in Eq. (5).

##### (5)

[TeX:] $$M_{h}(x) \equiv \frac{\sum_{i=1}^{n} G_{H}\left(\frac{x_{i}-x}{h}\right) w\left(x_{i}\right)\left(x_{i}-x\right)}{\sum_{i=1}^{n} G_{H}\left(\frac{x_{i}-x}{h}\right) w\left(x_{i}\right)}$$In Eq. (5), [TeX:] $$w\left(x_{i}\right)$$ is the weight function, and [TeX:] $$G_{H}$$ is expressed by Eq. (6).

##### (6)

[TeX:] $$G_{H}\left(\frac{x_{i}-x}{h}\right)=|H|^{-1 / 2} G\left(H^{-1 / 2}\left(\frac{x_{i}-x}{h}\right)\right)$$where [TeX:] $$G(x)$$ is the kernel and H is a positive definite symmetric matrix, that is, the bandwidth matrix.

The mean shift algorithm uses two kernels:

The uniform kernel, as shown by Eq. (7).

##### (7)

[TeX:] $$F(x)=\left\{\begin{array}{ll} 1 & \text { if }\|x\|<\lambda \\ 0 & \text { if }\|x\| \geq \lambda \end{array}\right.$$and the truncated Gaussian kernel, as shown by Eq.(8).

The value variations of the two kernels are presented in Fig. 4.

In UFD division, the geographical distance between two POIs presents a positive correlation with the possibility that they may be gathered together and classified into the same UFD. Hence, the shorter the distance from the clustering center, the larger the weight value that should be assigned. To achieve this, the truncated Gaussian kernel was selected for building the mean shift model in this study.

As presented in Table 1, urban POIs can be classified into 10 broad categories. However, no quantitative description is available for the attribute information, which has a decisive effect on UFD division. In [16], the authors proposed a method to extract hierarchical landmarks from urban POI data according to their significant attributes. In that work, a significance measure model comprising three vectors, namely, public cognition degree, urban centrality degree, and characteristic attribute value, was constructed by analyzing the factors influencing the significance of POI objects from public cognition, spatial distribution, and individual characteristics. In this study, we refer to the aforementioned conclusions to build the UFD division model. The weight values [TeX:] $$w\left(x_{i}\right)$$ for different types of POI were assigned as shown in Table 3.

Given kernel [TeX:] $$G(x)$$ and weight [TeX:] $$w(x),$$ Eq. (1) can be further expressed as Eq. (9).

##### (9)

[TeX:] $$M_{h}(x)=\frac{\sum_{i=1}^{n} G\left(\frac{x_{i}-x}{h}\right) w\left(x_{i}\right) x_{i}}{\sum_{i=1}^{n} G\left(\frac{x_{i}-x}{h}\right) w\left(x_{i}\right)}-x$$Let:

##### (10)

[TeX:] $$m_{h}(x)=\frac{\sum_{i=1}^{n} G\left(\frac{x_{i}-x}{h}\right) w\left(x_{i}\right) x_{i}}{\sum_{i=1}^{n} G\left(\frac{x_{i}-x}{h}\right) w\left(x_{i}\right)}$$The UFD division method based on mean shift algorithm comprises the following steps:

**Initialization:** Each POI data node in the whole sample is defined as the initial point and minimum tolerance error is set as the convergence condition of the mean shift algorithm.

**Step 1:** Calculate [TeX:] $$m_{h}(x).$$

**Step 2:** Assign [TeX:] $$m_{h}(x) \text { to } x.$$

**Step 3:** If [TeX:] $$\left\|m_{h}(x)-x\right\|<\varepsilon,$$ finish the circulation. Otherwise, proceed to Step 1.

**Step 4:** Take each POI node as the initial point and execute the preceding steps. Finally, the centers of the UFDs will be obtained.

## 4. Case Study

##### 4.1 Performance of Dichotomy-based Data Extraction

To verify the performance of the proposed dichotomy method for POI data retrieval, the traditional equality division method, which divides the entire geographical area into several equal sub-areas, was selected as a contrast. The complexity of the two methods was calculated for two cases under the same sample data conditions. For Case 1, the two methods were used to extract all the data points in the case sample, and the required searches, which denote the algorithm complexity, were compared. For Case 2, the two methods were used to perform the same steps, and the final tally of extracted POI data points was compared. The results are presented in Table 4.

It is evident from Table 4 that the proposed dichotomy method can reduce the number of steps required for extracting all the data points in the given sample by 40% compared with the traditional equality division method. Moreover, with the same number of steps (228), the traditional equality division method obtained only 9,146 data points whereas the proposed dichotomy method extracted 12,615, an increase of 27%. It is concluded that the dichotomy method applied to POI data retrieval demonstrates better performance by reducing data extraction complexity, and simultaneously improves the data extraction efficiency.

##### 4.2 UFD Division Result Analysis

The case sample was crawled from the Internet, and the contained a total of 10 million nodes. All these points were printed on a map using JavaScript, as shown in Fig. 5. For an intuitive presentation of the points, those with different attributes were assigned different colors.

The results of the UFD division are presented in Fig. 6. All 10 million nodes were classified into 48 regions with different color expressions. It is obvious that our division results show evidence of geographical agglomeration features. The division result also matches the UFD distribution in the official urban master plan (from 2011 to 2020), which is published by the city government and is quite authoritative (Fig. 7).

To further verify the performance of the proposed method, a quantitative similarity index was used to describe the rationality of the divided UFD. The similarity index was calculated using Eq. (11):

Here, n is the number of UFDs, [TeX:] $$X_{i}$$ is the full score of the similarity of nodes in each UDF, and [TeX:] $$1 x_{i}$$ is the actual similarity mark [17].

We selected 23 UFDs from a total of 48 to verify the performance. The selected 23 UFDs contained a higher number of nodes and covered a relatively large geographical area, as shown in Table 5. In the table, the compliance score denotes the conformity of the nodes in the same UFD. The highest score is 3, which represents complete conformity. Scores 2, 1, and 0 denote comparative conformity, comparative inconformity, and complete inconformity, respectively.

We can observe from the table that the number of UFDs with complete conformity is 16, comparative conformity is 6, and comparative inconformity is 1. Different UFDs have different compliance scores. For UFDs with a high density of POI data samples, the division results coincide well with the actual scenarios. However, for certain UFDs, such as Lancun Town in Table 5, the compliance score is quite low. This is because only a few POI samples were generally extracted from the Internet for these UFDs. In such cases, the final clustering centers are most vulnerable to the distribution of these limited samples and can induce deviations. In total, the similarity index was 88.41%, referring to Eq. (11). In conclusion, the division results suitably match the actual scenario.

## 5. Conclusion

Currently, UFD relies on field and questionnaire surveys. This type of work is time-consuming and entails high human resource requirements. Moreover, the division of functional areas is greatly affected by subjective components, and with the acceleration in urbanization, the urban functional areas have become more complex and diverse, increasing the difficulty of field investigation. According to the research presented in this paper, the algorithm can automatically identify the functional area, and the clustering method can be proven. The demand data are simple to acquire and do not require heavy manpower and material resources. The results of the algorithm were relatable to the actual functional area. The distribution is quite similar, the operation is simple, and accuracy can be ensured. Our proposed method can provide information support for the decision-makers of the city to rationally formulate urban planning strategies and solve problems in the urbanization process.

## Biography

##### Haiqing Liu

https://orcid.org/0000-0003-4094-7541He received the bachelor’s degree in automation from Central South University, China, in 2008, the Ph.D. in system engineering from Shandong University, China, in 2015. He is a lecturer in Shandong University of Science and Technology. From 2015 to 2017, he worked as a Post-Doctoral Researcher at the post-doctoral work station of Hisense Group, China. His current research interests include traffic engineering and control, cooperative vehicle infrastructure system and traffic intelligent perception.

## Biography

##### Shengli Wang

https://orcid.org/0000-0002-8852-8621He received the Ph.D. degree in instrument science and technology from Southeast University, China, in 2013. He is an associate professor in Shandong University of Science and Technology. From 2015 to 2017, and is the chief engineer of Shandong Astro-compass Information Technology Co. Ltd. His current research interests include traffic engineering and control, cooperative vehicle infrastructure system and co-location.

## Biography

## References

- 1 J. Wang, C. Li, Z. Xiong, Z. Shan, "Survey of data-centric smart city,"
*Journal of Computer Research and Development*, vol. 51, no. 2, pp. 239-259, 2014.custom:[[[-]]] - 2 H. W. Zhao, "The application of information surveying and mapping technology in national land survey,"
*Construction & Design For Project*, vol. 2019, no. 5, pp. 72-78, 2019.custom:[[[-]]] - 3 Y. Li, S. Geng, X. Zhang, H. Zhang, "Study of thermal comfort in underground construction based on field measurements and questionnaires in China,"
*Building and Environment*, vol. 116, pp. 45-54, 2017.custom:[[[-]]] - 4 Q. Yin, S. Y. Zhu, C. L. Gong, "Remote sensing analysis of the relationships between daytime ground bright temperature and land-use types of city: with shanghai as an example,"
*Journal of Infrared and Millimeter Waves*, vol. 28, no. 2, pp. 133-136, 2009.custom:[[[-]]] - 5 Z. Chen, T. C. Hutchinson, "Probabilistic urban structural damage classification using bitemporal satellite images,"
*Earthquake Spectra*, vol. 26, no. 1, pp. 87-109, 2010.custom:[[[-]]] - 6 Y. D. Eo, S. J. Moon, B. K. Lee, B. W. Park, "Analysis of 3 D city image updating techniques using terrestrial photogrammetry,"
*International Journal of Digital Content Technology and its Applications*, vol. 6, no. 17, pp. 520-531, 2012.custom:[[[-]]] - 7 D. J. Shane, M. A. Rufo, M. D. Berkemeier, J. A. Alberts, "Autonomous urban reconnaissance ingress system (AURIS): providing a tactically relevant autonomous door-opening kit for unmanned ground vehicles," in
*Proceedings of SPIE 8387: Unmanned Systems Technology XIV. Bellingham*, WA: International Society for Optics and Photonics, 2012;custom:[[[-]]] - 8 Q. Yan, C. Li, C. Chen, F. Luo, "Characteristics of activity space and community differentiation in Changchun: a study using mobile phone signaling data,"
*Human Geography*, vol. 33, no. 6, pp. 35-43, 2018.custom:[[[-]]] - 9 X. Zhang, S. Du, Q. Wang, "Hierarchical semantic cognition for urban functional zones with VHR satellite images and POI data,"
*ISPRS Journal of Photogrammetry and Remote Sensing*, vol. 132, pp. 170-184, 2017.custom:[[[-]]] - 10 G. Pan, G. Qi, Z. Wu, D. Zhang, S. Li, "Land-use classification using taxi GPS traces,"
*IEEE Transactions on Intelligent Transportation Systems*, vol. 14, no. 1, pp. 113-123, 2013.doi:[[[10.1109/TITS.2012.2209201]]] - 11 H. Zhang, R. Wang, B. Chen, Y. Hou, D. Qu, "Dynamic identification of urban functional areas and visual analysis of time-varying patterns based on trajectory data and POIs,"
*Journal of Computer Aided Design & Computer Graphics*, vol. 30, no. 9, pp. 1728-1740, 2018.custom:[[[-]]] - 12 Y. Wan, R. Wang, "Research on POI automatic classification assisted by comment information,"
*Journal of Geomatics*, vol. 43, no. 5, pp. 120-123, 2018.custom:[[[-]]] - 13 X. Guan, Y. Zeng, "Research progress and trends of parallel processing, analysis, and mining of big spatiotemporal data,"
*Progress in Geography*, vol. 37, no. 10, pp. 1314-1327, 2018.custom:[[[-]]] - 14 X. Chen, Y. Li, M. Lu, J. Lu, W. Chen, "Implementation of K-means algorithm based on dichotomy,"
*Radio Communications Technology*, vol. 43, no. 6, pp. 37-40. 2017. custom:[[[-]]] - 15 K. Fukunaga, L. Hostetler, "The estimation of the gradient of a density function, with applications in pattern recognition,"
*IEEE Transactions on Information Theory*, vol. 21, no. 1, pp. 32-40, 1975.doi:[[[10.1109/TIT.1975.1055330]]] - 16 W. Zhao, Q. Li, B. Li, "Extracting hierarchical landmarks from urban POI data,"
*Journal of Remote Sensing (Yaogan Xuebao)*, vol. 15, no. 5, pp. 973-988, 2011.custom:[[[-]]] - 17 Y. Kang, Y. Wang, Z. Xia, J. Chi, M. Jiao, Z. W. Wei, "Identification and classification of Wuhan urban districts based on POI,"
*Journal of Geomatics*, vol. 43, no. 1, pp. 81-85, 2018.custom:[[[-]]]