Article Information
Corresponding Author: Qing-Cui Hu* , hqc70681@163.com
Miao-Miao Liu*, Dept. of Computer Science, School of Computer and Information Technology, Northeast Petroleum University, Daqing, Heilongjiang, China, liumiaomiao82@163.com
Qing-Cui Hu*, Dept. of Computer Science, School of Computer and Information Technology, Northeast Petroleum University, Daqing, Heilongjiang, China, hqc70681@163.com
Jing-Feng Guo**, Dept. of Computer Science and Engineering, College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei, China, jfguo@ysu.edu.cn
Jing Chen, Dept. of Computer Science and Engineering, College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei, China, 57178461@qq.com
Received: November 7 2020
Accepted: December 4 2020
Published (Print): April 30 2021
Published (Electronic): April 30 2021
1. Introduction
Nowadays, users typically take to online social media to express their opinions, which can inherently be both positive and negative. So, this kind of social networks can be represented by signed social networks. Namely, signed networks are social networks with positive and negative sign attributes. The signs “+” and “–”, respectively indicate the positive relationships (such as support) and the negative relationships (such as opposition) between entities. The research of prediction in signed networks is of great significance for understanding the interaction between positive and negative relations, the formation principle and evolution mechanism of signs, and the analysis of network structure balance. It has important application value in node classification, attitude prediction, recommendation systems in the information field, and the analysis of intermolecular promotion or inhibition in the biological field.
For the prediction research in signed networks, most of the existing algorithms can only complete the prediction of the missing sign. Few of them can achieve the double tasks of link prediction and sign prediction. Though the research of Liu et al. [1] has realized both the sign prediction and link prediction in signed networks, the accuracy of the algorithm depended on the value of the influence factor of the adjustable step size to a certain extent. Besides, relevant researches on mainstream link prediction methods based on the similarity show that, when the step size is greater than 3, the computational complexity of the algorithm increases greatly while its prediction accuracy is not significantly improved. Given the above problems, in order to achieve a better balance between prediction accuracy and computational complexity, a novel algorithm PSSN_TLG (prediction in signed social networks based on tightness of local and global, hereinafter referred to as TLG) is proposed aiming to complete both the link prediction and sign prediction. Based on the structural balance theory, the local link tightness and global link tightness are defined respectively by using the structure information of paths with a step size of 2 and 3 between the two nodes. Then the total similarity is obtained through fusing the above two results to perform the link and sign prediction.
2. Related Work
This paper focuses on the link prediction and sign prediction in undirected signed networks. At present, most of the related works mainly focused on sign prediction. These methods can be roughly divided into supervised learning and unsupervised learning. The prediction method with supervised learning is mainly based on the existing sign attributes of links to select structural features and use decision trees and other classifiers to complete sign prediction. According to the different information used in the selection of network structure features, they are further divided into sign prediction algorithms based on structural balance theory and sign prediction algorithms based on user interaction behavior and context information [2]. For example, in [3], the authors mined the characteristics of unlabeled links and improved the sign prediction accuracy through migration learning of four network features such as node degree, structural balance, status, and potential state. The prediction method with unsupervised learning mainly uses the network structure attributes and the interactive information between nodes to complete sign prediction. They could be roughly divided into two categories, namely, sign prediction based on matrix decomposition or matrix filling and sign prediction based on the node similarity. The representative research achievements are as follows. Su and Song [4] proposed a low-rank matrix decomposition model with offset, which introduced the sign of out-edge and in-edge of neighbor nodes as offset information to improve the accuracy of sign prediction. Shen et al. [5] mainly focused on the prediction of negative links and proposed a framework based on projected non-negative matrix decomposition through unsupervised learning embedded in network structure and user attributes. Although the sign prediction algorithms based on supervised learning and matrix can achieve high prediction accuracy, they usually have high computational complexity, and these models are hard to evaluate. Moreover, due to the poor prediction effect for sparse networks, they are not conducive to the extensive promotion in practical applications. Therefore, the algorithm based on node similarity is still the mainstream link prediction method, and the representative works are as follows. Girdhar et al. [6] proposed a link prediction model based on local and global information, and they distinguished the connection strength between entities based on the fuzzy computing model of trust and distrust. Chen et al. [7] used the set pair theory to treat the sign network as an identity-discrepancy-reverse system and proposed a sign prediction algorithm by fusing the certainty and uncertainty relations as well as local and global information. Zhu and Ma [8] proposed a highly symmetrical quadrilateral structure by analyzing the sign generation mechanism. Based on the statistical characteristics of local structure, the similarities and divergences of node pairs, and the structural characteristics reflecting the positive and negative attitudes of nodes were extracted. Then the sign prediction was completed.
In addition to the above methods, related scholars have also studied the link prediction in sparse signed networks. For the problem of the sparseness of signed link data in signed networks, i.e., only a small percentage of signed links are given and the number of negative links is much smaller than that of positive links, Beigi et al. [9] proposed a novel signed link prediction model which enabled the empirical exploration of user personalities via social media data. Based on psychology theories, they extracted additional information about users’ personalities such as optimism and pessimism that can help determine their propensity in establishing positive and negative links to compensate for data sparsity. Generally, in social networks, users tend to establish positive (or negative) links with those whose generated content we frequently positively (or negatively) interact with online. Based on the verification of these assumptions, a framework for solving the link and interaction polarity prediction problem in signed networks was proposed [10] by understanding the correlation between these two types of opinions from both a local and global perspective. The algorithm was demonstrated to be helpful for both the data sparsity and cold-start problems found in signed networks. In recent years, some scholars have studied the link representation and prediction methods based on convolution neural networks and recurrent neural networks with deep learning mechanism. The ordered node sequence is constructed by using the local topological structure between nodes, and the matrix representation of potential links is generated by node vector expression. Finally, the multi-layer implicit relations of node pairs in the node sequence are extracted based on the neural network operations and used to realize link prediction. However, most of these algorithms focus on link prediction in traditional social networks, and few studies on link prediction and sign prediction for signed networks are found.
3. Problem Statement
3.1 Theoretical Foundation
An undirected and unweighted signed network is usually formalized as G=(V, E, S), in which, [TeX:] $$V=\left\{v_{1}, v_{2}, \ldots, v_{n}\right\}$$represents the node-set, [TeX:] $$\mathrm{E}=\left\{e(i, j) \mid v_{i,} v_{i} \in V, i \neq j\right\}$$ represents the edge-set, and [TeX:] $$S=\{\operatorname{sign}(i, j)\left.\mid v_{i}, v_{j} \in V, i \neq j\right\}$$ represents the set of signs with values as follows
The structural balance theory provides a theoretical basis for the analysis of undirected signed networks, as shown in Fig. 1. According to this theory, a loop consisting of k edges (k≥3) is structurally balanced if and only if the product of the sign of all edges is positive. Relevant studies show that the number of balanced rings is far greater than that of unbalanced rings, and with the dynamic development of the network, unbalanced structures will evolve towards balanced structures. Therefore, the structural balance theory has been widely used in the sign prediction of undirected signed networks.
Sketch of the structural balance theory.
3.2 Problem Definition
The prediction research in signed networks includes two aspects. On the one hand, it is necessary to analyze the probability of the establishment of links between two nodes that have not connected, namely, the link prediction problem. It is generally believed that the higher the similarity between the two nodes, the greater the possibility of establishing a link between them. On the other hand, it is necessary to analyze the sign type of new links or existing links without sign attributes, namely, the sign prediction problem. Generally, the algorithm should strive to ensure that the ring where the predicted link is located can maximize the structural balance of the network as much as possible. Therefore, in the paper, we define the similarity between two nodes based on the characteristics of network topology and the structural balance theory aiming to achieve the two tasks of link prediction and sign prediction.
The goal of the algorithm can be expressed as follows. Given G= (V, E, S), [TeX:] $$\forall v_{i}, v_{j} \in V$$, we aim to predict the possibility of establishing a link between the node [TeX:] $$v_{i}$$ and [TeX:] $$v_{j}$$ where e(i, j)=0 and the corresponding sign type of the link. The algorithm also predicts the missing sign attribute of existing links where [TeX:] $${e(i, j)}=1$$. As shown in Fig. 2, the solid line represents the known edge, and the dotted line represents the unknown edge (which does not exist or exists but has not yet been observed). Nodes of [TeX:] $$v_{1}$$ and [TeX:] $$v_{10}$$ are not directly connected at present, but due to the existence of common neighbors, [TeX:] $$v_{1}$$ can reach [TeX:] $$v_{10}$$ through multiple paths. That is to say, the probability of establishing a link between them is very high. However, due to the prediction results of sign(1,10) based on each balanced ring are different which can be positive or negative, the algorithm should effectively integrate the local and global structural features that affect the similarity of the two nodes to define the total similarity based on the concept of the k-balanced ring. And its ultimate goal is to get the probability of establishing a link between [TeX:] $$v_{1}$$ and [TeX:] $$v_{10}$$ and to predict the corresponding sign type, namely, sign(1,10).
Diagram of problem definition of the TLG algorithm.
4. Proposed Algorithm
4.1 Core Ideas and Related Definitions
According to the local structure information of the social network graph, the classical CN-index took the number of first-order common neighbors of the node pair into account, and the Jaccard index considered the number of non-common neighbor nodes. While the AA index and RA index considered the degree of the first-order common neighbors. The four indicators respectively focused on one of the factors that affect the similarity, which ignored the influence of other local features on the similarity of the node pair. Herein, using sim(x,y) represents the similarity value of the node pair <[TeX:] $\mathrm{V}_{\mathrm{x}}$$$, [TeX:] $$\mathrm{V}_{\mathrm{y}}$$>. According to CN and Jaccard indexes, sim(1,10) in Fig. 3(a) is greater than sim(1,10) in Fig. 3(b). According to AA and RA indexes, sim(1,10) in Fig. 3(a) is greater than sim(1,10) in Fig. 3(c).
Diagram of similarity definition based on local tightness.
Definition 4.1 (Local link tightness). To improve the prediction accuracy, the TLG algorithm considers the degree of the two nodes, their first-order common neighbors and non-common neighbors, the sign of the edge, and other local structural characteristics. [TeX:] $$\forall v_{x}, v_{y} \in V$$ and sign(x,y)=0, based on the concept of the structural balanced ring, the local link tightness of the node pair is defined by using the information of paths connecting the two nodes with a step size of 2, which is recorded as TLSim(x, y). In this paper, k(x) represents the degree of the node [TeX:] $$v_{x}$$, [TeX:] $$N_{1}(x)$$ and [TeX:] $$N_{2}(x)$$, respectively, represent the first-order and second-order neighbor sets of node [TeX:] $$v_{x}$$, and [TeX:] $$\left|N_{1}(x)\right|$$ represents the size of the set [TeX:] $$N_{1}(x)$$.
Definition 4.2 (Global link tightness). According to the global structure information of the network, the LP and Katz indexes consider the influence of the path with a step size of 2 and 3 or all paths on the similarity respectively and assign weight decay factors to paths with different step size. But the LP index ignores the impact of local information on the similarity. While Katz index needs to calculate all path information, which will lead to high complexity. Given this, to achieve a better balance between accuracy and complexity, the TLG algorithm takes the influence of structure information of path with a step size of 3 connecting the two nodes on the similarity into account. Firstly, the link strength of two directly connected nodes is defined. Then, based on the concept of the structural balanced ring, the global link tightness is defined by the link strength of three groups of intermediate transmission node pairs on the path that connects the two nodes with the step size of 3, written as TGSim(x, y).
In the definition of TGSim(x,y), [TeX:] $$l_{k}$$ is the Kth path with a step size of 3 connecting [TeX:] $$v_{x}$$ and [TeX:] $$v_{y}$$, namely, [TeX:] $$l_{k}=v_{x} e\left(x, z_{k}^{\prime}\right) v_{z_{t}} e\left(z_{k}^{\prime}, z_{k}^{\prime}\right) v_{z_{k}} e\left(z_{k}^{\prime}, y\right) v_{y}$$. While [TeX:] $$v_{z_{k}^{\prime}}$$ and [TeX:] $$v_{z_{k}}$$ are the two intermediate nodes on the path [TeX:] $$l_{\mathrm{k}}$$. That is to say, [TeX:] $$v_{\varepsilon_{k}} \in N_{1}(x) \cap N_{2}(y), v_{\varepsilon_{k}} \in N_{1}(y) \cap N_{2}(x), \quad e\left(x, z_{k}^{\prime}\right)=1, \quad e\left(z_{k}^{\prime}, z_{k}^{\prime}\right)=1, e\left(z_{k}^{\prime}, y\right)=1$$.
Definition 4.3 (Total similarity between the two nodes). The total similarity between two unconnected nodes is defined as the sum of the local and global link tightness of the two nodes, which is denoted as TLGSim(x, y), as follows. |TLGSim(x, y)| represents the possibility of the node [TeX:] $$v_{x}$$ and [TeX:] $$v_{y}$$ to establish a link, and the prediction result of sign(x, y) is the same as the sign type of TLGSim(x, y).
4.2 Description and Implementation
Algorithm TLG Input: G= (V, E, S), which describes the static snapshot graph of an undirected signed network. Output: the Top-k links that are the most likely to establish in G, their corresponding signs, and the missing sign attributes of existing edges in G.
[TeX:] $$\forall v_{x}, v_{y} \in V$$, if e(x, y)=0 or e(x, y)=1 but sign(x, y)=0, perform the following operations.
1) Read the dataset file and obtain the adjacency matrix and adjacency table of the graph G. 2) Find all of the node pair <[TeX:] $$v_{x},v_{y}$$>, get the adjacency lists of [TeX:] $$v_{y}$$ and [TeX:] $$v_{y}$$, and calculate TLSim(x, y). 3) Get all the second-order neighbors of [TeX:] $$v_{x}$$ and [TeX:] $$v_{y}$$, and calculate TGSim(x, y). 4) Calculate TLGSim(x, y). 5) Do the prediction. In terms of the prediction of the missing sign of the existing edge, if TLGSim(x, y)>0, then sign(x,y)=+1. While if TLGSim(x, y)<0, sign(x, y)=-1. In terms of the prediction of unknown links, we sort |TLGSim(x, y)| in descending order, and output the node pairs corresponding to the Top-k values, which have the highest probability of establishing links. And the prediction result of sign(x, y) corresponding to the unknown link is the same as the sign type of the value of TLGSim(x, y).
5. Experiments
Six representative datasets are selected, and the training set and test set are divided by the 10-fold cross-validation method. The improved AUC′ and Accuracy′ are used as evaluation indexes. And the results of the proposed algorithm are compared with CN (common neighbor), ICN (improved common neighbor), and PSNBS (prediction in signed networks based on balance and similarity) algorithm [1].
5.1 Dataset
Three classical large-scale real datasets and three small datasets are used for the experiment. The basic information is shown in Table 1. Among them, the FEC and CRA network are two simulation datasets, and the Gahuku Gama Sub-tribes signed network, written as GGS, is a real dataset.
Basic characteristics of the dataset
5.2 Evaluation Index
The classical index AUC [2] is suitable for the evaluation of link prediction accuracy in traditional social networks with only positive links. However, the total similarity score between two nodes in the TLG algorithm may be positive or negative. Its absolute value represents the probability of the two nodes to establish a link, and it is positive or negative represents the sign type of the predicted link. Therefore, in our experiments, the classical AUC index is adjusted to obtain a new index, written as AUC’, which can correctly evaluate the link prediction results in signed networks. Its definition is shown as follows. In the experiment, the total similarity of each node pair corresponding to the edge selected randomly from the test set and the edge selected from the non-existent edge-set is calculated. And the comparison is done only when the signs of the two similarity values are the same. If the absolute value of the total similarity between the two nodes corresponding to the edge selected from the test set is greater than that of the edge selected from the nonexistent edge-set, [TeX:] $$\bar{n}^{\prime}$$' will plus 1. If the absolute values of the two are equal, [TeX:] $\bar{n}^{\prime \prime}$$$ will plus 1. If the signs of the two are different, the calculation will be canceled, and we will choose edges again. In each group of experiments, it should be ensured that there are 20,000 times that the signs of the total similarity value corresponding to the two edges selected are the same. That is to say, the value of [TeX:] $$\bar{n}$$ is 20,000.
For the sign prediction in signed networks, a series of evaluation indexes can be obtained, such as TP, FP, TN, FN, Recall, Precision, Accuracy, F1-score, etc. The sign prediction in signed networks needs to evaluate the comprehensive prediction accuracy of positive and negative signs. Relevant researches have shown that in most real signed networks the ratio of the number of positive links to the number of negative links is greater than 3:1. That is to say, the probability that positive links are selected in the experiment is much higher than that of negative links. Therefore, we improved the index of Accuracy, and the index Accuracy' is obtained by adjusting the above indexes to give a weight of 1 for sign prediction results of positive edges and a weight of 0.5 for sign prediction results of negative edges, which is used as the comprehensive evaluation index, as follows.
5.3 Experimental Results and Analysis
(1) Analysis of link prediction accuracy: The experimental results of the link prediction accuracy of the TLG algorithm based on the AUC′ evaluation index are shown in Fig. 4. Prediction accuracy and the average value of ten independent experiments which is abbreviated as AVG are obtained. The number of links that are effectively predicted and compared in each independent experiment is 20,000, namely, [TeX:] $$\bar{n}$$=20,000 . As can be seen from Fig. 4, the TLG algorithm shows high link prediction accuracy for the three large real signed networks and the small simulation network CRA (as shown in Figs. 5 and 6). These four datasets are signed networks with conventional topology in which the number of positive links is much larger than that of negative links and the degree distribution of nodes has no particularity.
Prediction results of TLG based on AUC′.
Degree distribution of CRA network.
For the GGS network, the accuracy of the proposed algorithm is lower than that of the first four datasets. This network describes the real political alliance and enmity among 16 sub-tribes as shown in Fig. 7. The proportion of positive links and negative links is 1:1, and the degree distribution of 16 nodes is shown in Fig. 8. For the dataset with the same number of positive and negative links, the prediction accuracy of the TLG algorithm can still reach 67%, which shows its good robustness.
Degree distribution of GGS network.
For the FEC network, the accuracy of the TLG algorithm remains 0.5. The topology and node degree distribution of the dataset are shown in Figs. 9 and 10. It can be seen that 28 nodes are divided into two types with the same degree distribution. When calculating AUC′, the topological structure of the links obtained from the test set and the nonexistent edge set is almost the same in most cases. In other words, the probability that the degree distribution of the node pairs corresponding to these two links is not the same is [TeX:] $$\mathrm{C}_{24}^{2} \mathrm{C}_{4}^{2} / \mathrm{C}_{28}^{2} \mathrm{C}_{26}^{2}$$, which equals 0.0135. Namely, [TeX:] $$\bar{n}^{\prime} \approx 0$$ and [TeX:] $$\bar{n}^{\prime \prime} \approx \bar{n}$$. Therefore, the value of AUC′ should be 0.5. The experimental results also verify the correctness of the TLG algorithm.
. Degree distribution of FEC network.
(2) Analysis of sign prediction accuracy: Similarly, the sign prediction accuracy of the proposed algorithm is verified on six datasets. The experimental results are shown in Table 2 and Fig. 11.
Experimental results of sign prediction of TLG
Prediction results of TLG based on Accuracy′.
(3) Comparison with other algorithms: Experimental comparisons are carried out with the CN-Predict, ICN-Predict and the PSNBS algorithm. The AUC′, Accuracy′, and AUC [1] are used as evaluation indexes. Results are shown in Fig. 12, Fig. 13, and Fig. 14, respectively. Note that the results of the PSNBS algorithm are the highest accuracy when the step size influence factor [TeX:] $$\lambda$$ is assigned the optimal value. Results show that the TLG algorithm does not need to preset the influence factor, and it can achieve a good prediction effect for both conventional large-scale signed networks and small-scale signed networks with the special topology. Moreover, its accuracy is higher than other algorithms.
Link prediction results based on AUC′.
. Sign prediction results based on Accuracy′.
Comparison results based on AUC [ 1].
(4) The Top-k recommended links: According to the TLG algorithm, the Top-100 recommended links are given for the first three large-scale networks, as shown in Figs. 15–17. Meanwhile, the top-50 recommended links are also given for the two small-scale datasets, as shown in Fig. 18. In these four figures, the serial number of the node pairs and their similarities are displayed. And the prediction of positive links and negative links are respectively represented by different colors.
Besides, the recommendation results of links based on the TLG and PSNBS algorithm are compared on the two small datasets, as shown in Fig. 19. Results of the proposed algorithm are consistent with that of the PSNBS algorithm, which further verifies the correctness and effectiveness of the TLG algorithm.
Recommended links for Epinions.
Recommended links for Slashdot.
Recommended links for Wikipedia.
Recommended links for GGS and CRA
Comparison of results of recommended links.
6. Conclusion
A link prediction algorithm for signed networks is proposed. It effectively obtains the local and global structural features that affect the similarity of nodes and can achieve the dual goals of link prediction and sign prediction. Experimental results show its better prediction effect than other methods. However, for large-scale signed networks, how to reduce the computational complexity is a key problem. Besides, the proposed algorithm takes the static snapshot of the signed network as the research object while the real social network is dynamic. Therefore, in the next research work, perhaps we should integrate deep learning methods to mine more structural properties to complete the link prediction of signed networks.
Acknowledgement
This work was supported by the Natural Science Foundation of China (No. 42002138 and 61871465), Natural Science Foundation of Heilongjiang Province (No. LH2019F042 and LH2020F003), Youth Science Foundation of Northeast Petroleum University (No. 2018QNQ-01), and Science & Technology Program of Hebei (No. 20310301D).