## Xinpan Yuan* , Songlin Wang* , Lanjun Wan* and Chengyuan Zhang**## |

Input：SA, SB, 〖 u〗_0 Required variables: S=[ ] is similar element set ; NS=[ ] is non-similar set of elements, maxsim=0;Output: u=[] |

1: For ai in SA do 2: For bi in SB do 3: if cos(ai ,bi) > maxsim 4: maxsim = cos(ai ,bi); 5: End for 6: if maxsim >〖 μ〗_0 then 7: S.add(ai) 8: u.add(maxsim) 9: else 10: NS.add(ai) 11 u.add(0); 12: End for |

In the context of big data, the text-similarity performance test system requires an efficient sentence similarity algorithm, the time complexity of the SSF focuses on the optimization of calculating similar elements.

To reduce the calculation of a double cycle to one cycle, a further approach is to construct an index of [TeX:] $$$S_{B}$ for each vector $a_{i}$ in $S_{A}$$$.

According to experience, the dimension of the word vector is generally 200–300 dimensions to get better results, using the open-source project FAISS to do this index job.

For a vector [TeX:] $$$a_{i}$ in $S_{A},$ we search for the vector $b_{j}$$$ with the highest similarity in the temporary [TeX:] $$index $_{i}$$$, so that the process requires only one similarity calculation.

The n times calculations of similar elements [TeX:] $$$<a_{i}, b_{j}>$$$ are reduced to vector searching, thereby reducing the execution time of SSF.

But there is a flaw that when every time similar elements of a sentence are calculated, a temporary index needs to be built once, and the index cannot be reused.

The method of a temporary index is called the SSFT algorithm (SSF Temporary Index), as shown in

In order to solve the problem that the index can’t be reused, we off-line establish an index of potential similar elements in the process of word vector training. We could search for the index to complete the calculation of similar elements without having to repeatedly create a temporary index.

The steps for creating the index of potential similar elements:

Step 1: Establishing an index for all the word vector set by trained word vector model.

Step 2: Traversing any vector v to search the index to get a return set. In this set, the potential similar elements have abstained with the similarity is greater than the threshold [TeX:] $$$\mathcal{U}_{0}$$$, in similarity descending order.

Step 3: The physic index of potential similar elements could be implemented by a Huffman tree.

According to the hierarchical Softmax strategy in word2vec, an original word2vec Huffman tree constructed on the basis of the word frequency, and each node (except the root node) represents a word and its corresponding vector.

We try to replace the vector with potential similar elements. So each node of the tree represents a word and its corresponding potential similar elements instead, as shown in Fig. 2. This method via the global index of potential similar elements is called to be SSFP, as shown in Table 2.

Table 2.

Input：SA, SB, 〖 u〗_0 Required variables: S=[ ] is similar element set ; NS=[ ] is non-similar set of elements, P=[] is potential similar elements; maxsim=0; Output: u=[] |

1: For ai in SA do 2: P[]= searchHuffman(ai); 3: For bj in SB do 4: For pk in P[] do 5: if bi .equals(pk.vector) 6: S.add(ai) 7: u.add(pk.sim) 8: Break to loop ai+1; 9: End for 10: End for 11: NS.add(ai) 12: u.add(0); 13: End for |

Suppose that the number of sentence pairs to be similarly calculated is L, the average number of word vectors of the word vector set [TeX:] $$$S_{A}$$$ is m, and the average number of word vectors of the word vector set [TeX:] $$$S_{B}$$$ is n, d represents the dimension of the vector.

(1) In SSFN, whether all elements which constitute similar elements ([TeX:] $$u_{i}$$) are calculated once by using formula (6), and the time consumption of calculation is determined by the number of vector dimension, the time complexity of SSFN is O(Lmn×Tcos), where Tcos is the time of [TeX:] $$\cos (\theta)$$ between vector [TeX:] $$$a_{i}$ and $b_{i}$$$, and equals d. so the time complexity of SSFN is O(Lmn×d).

(2) In SSFT, in order to reduce the number of similar element calculations in SSFN, the method of constructing an index is used, the index is equivalent to fuzzy search, and then the similar element is calculated to determine whether it constitutes a similar element, the time complexity of SSFT is O (LmTindex +Lnlog(m)d), where Tindex is the time to index each word vector, and log(m) is the number of times to look up in the index. Since the Tindex value is small that LmTindex can be ignored, compared with Lnlog(m)d. the time complexity of SSFT approximately equals to O(Ln×log(m)d).

Suppose d (the number of dimensions) is 300, m (the number of word vectors of the word vector set [TeX:] $$$S_{A}$$$) is 50, and K (the total number of the dictionary) is 160,000. We can deduce that the time of Tcos is seen like 300 times of CPU operation, the time of searching Huffman tree log(K) is seen like 400 times of CPU operation. So the time proportion of SSFP:SSFT:SSFN equals to 400:2100:15000. So the time complexity of SSF is shown in Table 3.

The data set of training word vector model is Wikipedia Chinese corpus [20] which requires data cleansing, traditional and simplified conversion, and so on. The amount of Wikipedia Chinese corpus downloaded is about 1.12 GB and 4 million articles. The model of training word vector is the combination of the Skip-gram model and hierarchical Softmax strategy, and the code of training word vector is the word vector toolkit [21]. In general, the threshold value of calculating similar elements is [TeX:] $$$\mu_{0}=0.5$$$, the dimension number of word vector is 200.

The algorithms involved in the time complexity comparison experiment are SSFN, SSFT, and SSFP, since SSFN, SSFT and SSFP only calculate the similarity element in different ways, the error of the similarity calculation results of the three is zero, and the similarity accuracy is consistent. SSFN, WMD and WJ algorithms were selected to participate in the experiment of precision comparison. An overview of the algorithms involved in the accuracy comparison is shown in Table 4.

Table 4.

Algorithm | Brief description | Ref. |
---|---|---|

WMD | Calculating the moving distance of word vector | [22] |

SSF | Calculating the similarity by similar elements and no-similar elements | [23] |

WJ | Take the weight of words into account in the similarity calculation | [24] |

The experimental dataset is divided into the following three categories:

- News: from Guangming Daily, People's Daily, and other News articles (400 news), the average length of news is 500 words;

- Wikis: from Wikipedia and other encyclopedia texts (400 wikis), the average length of the wiki is 2000 words;

- Papers: from papers and periodicals, etc. (400 papers), the average length of a paper is 20,000 words.

Each kind of data set which is randomly selected from the three categories is divided into four groups, each group contains 100 pieces. The first group has 100 titles of 10 words, the second group has 100 titles of 25 words, the third group has 100 titles of 35 words, and the fourth group has 100 titles of 50 words (Table 5).

Table 5.

Datest | |||
---|---|---|---|

News | Wikis | Papers | |

Number of data | 400 | 400 | 400 |

Average number of words | 500 | 2,000 | 20,000 |

Average number of sentences | 20 | 80 | 800 |

Group | Group 1 (100 titles of 10 words), Group 2 (100 titles of 25 words), Group 3 (100 titles of 35 words), Group 4 (100 titles of 50 words) |

The first is the accuracy of similarity calculation comparison, the essence of WMD is to calculate the distance between texts, and WJ and SSF are to calculate the similarity directly. The distance of the text and similarity can swap, the smaller the distance, the greater the similarity, the greater the distance, the smaller the similarity. In order to make a reasonable evaluation, we used sentence similarity calculation in [2] as an experimental method.

As shown in Fig. 3, the experimental process is divided into two stages: calculation and result comparison:

1. Calculation step: Traversing the similarity (or distance) between the title sentence and the Si in the sentence set S.

2. Results comparison step: The sentence with the highest similarity (or the smallest distance) compared with the artificially marked sentence.

First of all, a graduate student of Chinese language literature is invited to manually mark the most similar sentences which are the most similar to the title in each type of data set. The algorithm then calculates the sentences that are most similar to the title. For example, we take the title as the original sentence, to calculate the minimum distance with the original sentence on WMD, to calculate the most similar to the original sentence on WJ and SSF. Finally, the accuracy of the algorithm is examined by comparing the difference of accuracy between the results of the algorithm and the result of artificial markers.

In each class of data sets, 100 articles are selected according to the length of the title 10, 25, 35, and 50 words and then divided into 4 groups. Each algorithm is carried out in each group of data sets, and then the average accuracy of each algorithm in each group is calculated. Three kinds of datasets are selected and 4 groups, each algorithm calculates the average accuracy of 12 times, and there are 3 algorithms involved in the comparison of similarity accuracy. In total, the average accuracy was calculated to 36 times.

Time performance is evaluated by the average cost time comparison of algorithms (SSFN, SSFT, and SSFP) hits in each group, there are 4 algorithms involved in the comparison of query time. In total, the average query time was calculated to 48 times.

The comparison of the average accuracy on algorithms (WMD, SSF, and WJ) is shown in Fig. 4. There are the following discussions:

(1) The average accuracy of each algorithm is different under the same header length. The general trend is that as the number of sentences increases, the average accuracy tends to decrease. And the accuracy comparison of the 3 datasets shows that the larger the dataset, the smaller the accuracy. This is normal because the increase in data size will always increase the error hit, leading to a decrease accuracy.

(2) WJ's accuracy reduces with the length increased, since WJ only considers the influence of the word’s weight, the similarity of words and other factors on the overall similarity of sentences, and it couldn’t effectively excavate the deep semantic information of the sentence and lacks the penalty term, so it has not solved the similarity of the long sentence.

(3) SSF's accuracy is always higher than WMD in the five groups of each data set. We try to explain the results with the theoretical analysis:

- The negative effect of penalty item: SSF’s penalty items composed of noise words, single words, and non-similar elements. SSF takes full account of the harmful or undesirable influences of penalty terms on the overall similarity of sentences. However, WMD does not consider the negative effects of penalty terms on the calculation of sentence similarity.

- Not satisfying the exchange rule: WMD satisfies the exchange rules. Taking the title sentence as the standard, SSF compares the text sentence to the title sentence. When the length of the two sentences is basically the same, the two algorithms do not show any significant difference. When two sentences differ greatly in length, the SSF algorithm will show great advantages.

In conclusion, SSF takes the first sentence as the standard, would be ideal for two sentences with large length differences and partial content similarity, good at calculating "short-long" sentence similarity. At the same time, SSF takes full account of the negative effect of penalty terms on the overall sentence similarity, the accuracy of similarity calculation of "long-long" sentences is improved, which also makes SSF have high accuracy.

This experiment is a comparative experiment of the average time-consuming index of SSFN, SSFT, and SSFP under three sets of data sets, and the results measured in the experimental data sets are shown in Fig. 5. There are the following discussions:

(1) SSFN takes the most time because SSFN needs mn times of similar element calculation. SSFT builds temp index to lookup to complete the calculation of similar elements, takes less time than SSFN. The disadvantage of SSFT is that every time the similarity of a sentence is calculated, an index should be set up, the index cannot be reused. Using the idea of finding similar elements instead of calculating them, the difference of SSFP is that SSFP establishes the global index for all potential similar element of word vector by training the word vector model meanwhile. So the global index could be reused, SSFP takes less time than SSFT.

(2) As the length of the title increases in the same data set, the average time spent by each algorithm increases. This is because the number of times the similar elements are calculated in long sentences is more than that of short sentences, so it takes more time to calculate the similarity of long sentences than short sentences. At the same time, as the number of sentences included in the data set increases, the average time-consuming time of each algorithm tends to increase. For example, the Paper set contains more sentences than News and Wiki, so the average time taken in the group of the same title length is high.

This paper presented and analyzed some possible approaches for sentence similarity functions. We proposed the SSF that uses word2vector as the system elements to calculate the similarity of sentences. The accuracy results show our novel SSF generally performs superior. The time complexity analysis and time experimental evaluation show the query time performance of SSFP is optimal.

We are grateful to the support of research fund of Hunan Province (2018JJ2099, 19C0558, CX1911), Hunan open fund (Project Similarity detection cloud platform; Clothing material aggregation and prediction), Hunan Education Reform fund (The whole process performance management and evaluation system of innovation and entrepreneurship center).

He received his Ph.D. degree in Computer Science from Hunan University, in 2016. He is currently an assistant professor of Computer Science at Hunan University of Technology. His research interests include high-performance computing, parallel computing, heterogeneous computing and parallel programming model.

He received PhD degree in computer science from the University of New South Wales. Currently, he is a lecturer in School of Information Science and Engineering of Central South University, China. His main research interests include information retrieval, query processing on spatial data and multimedia data.

- 1 M. U. Devi, G. M. Gandhi, "Query expansion on the role of word and sentence similarity for domain ontology driven fuzzy retrieval systems,"
*Journal of Computational and Theoretical Nanoscience*, vol. 14, no. 6, pp. 2612-2619, 2017.doi:[[[10.1166/jctn.2017.6548]]] - 2
*W. Yin, K. Kann, M. Yu, and H. Schutze, 2017;*, https://arxiv.org/abs/1702.01923 - 3 D. Zhang, T. He, Y. Liu, S. Lin, J. A. Stankovic, "A carpooling recommendation system for taxicab services,"
*IEEE Transactions on Emerging Topics in Computing*, vol. 2, no. 3, pp. 254-266, 2014.doi:[[[10.1109/TETC.2014.2356493]]] - 4 J. R. Lin, Z. Z. Hu, J. P. Zhang, F. Q. Yu, "A natural‐language‐based approach to intelligent data retrieval and representation for cloud BIM,"
*Computer‐Aided Civil and Infrastructure Engineering*, vol. 31, no. 1, pp. 18-33, 2016.custom:[[[-]]] - 5
*A. Prakash, S. A. Hasan, K. Lee, V. Datla, A. Qadir, J. Liu, and O. Farri, 2016;*, https://arxiv.org/abs/1610.03098 - 6 M. A. Boudia, A. Rahmani, M. E. Rahmani, A. Djebbar, H. A. Bouarara, F. Kabli, M. Guandouz, M., "Hybridization between scoring technique and similarity technique for automatic summarization by extraction,"
*International Journal of Organizational and Collective Intelligence*, vol. 6, no. 1, pp. 1-14, 2016.doi:[[[10.4018/IJOCI.2016010101]]] - 7 P. W. McBurney, C. McMillan, "Automatic source code summarization of context for Java methods,"
*IEEE Transactions on Software Engineering*, vol. 42, no. 2, pp. 103-119, 2015.doi:[[[10.1109/TSE.2015.2465386]]] - 8
*S. K. Bharti and K. S. Babu, 2017;*, https://arxiv.org/abs/1704.03242 - 9 Y. C. Lee, C. M. Eastman, W. Solihin, "An ontology-based approach for developing data exchange requirements and model views of building information modeling,"
*Advanced Engineering Informatics*, vol. 30, no. 3, pp. 354-367, 2016.doi:[[[10.1016/j.aei.2016.04.008]]] - 10 J. Muralikumar, S. A. Seelan, N. Vijayakumar, V. Balasubramanian, "A statistical approach for modeling inter-document semantic relationships in digital libraries,"
*Journal of Intelligent Information Systems*, vol. 48, no. 3, pp. 477-498, 2017.doi:[[[10.1007/s10844-016-0423-6]]] - 11 G. Zhou, J. Zhao, T. He, W. Wu, "An empirical study of topic-sensitive probabilistic model for expert finding in question answer communities,"
*Knowledge-Based Systems*, vol. 66, pp. 136-145, 2014.doi:[[[10.1016/j.knosys.2014.04.032]]] - 12 S. Guo, D. Xing, "Sentence similarity calculation based on word vector and its application research,"
*Modern Electronics Technique*, vol. 39, no. 13, pp. 99-102, 2016.custom:[[[-]]] - 13 F. Li, J. Hou, R. Zeng, C. Ling, "Research on multi-feature sentence similarity computing method with word embedding,"
*Journal of Frontiers of Computer Science and Technology*, vol. 11, no. 4, pp. 608-618, 2017.custom:[[[-]]] - 14 S. Arora, Y. Liang, T. Ma, "A simple but tough-to-beat baseline for sentence embeddings," in
*Proceedings of the 5th International Conference on Learning Representations (ICLR)*, Toulon, France, 2017;custom:[[[-]]] - 15 Y. Mrabet, H. Kilicoglu, D. Demner-Fushman, "TextFlow: a text similarity measure based on continuous sequences," in
*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics*, Vancouver, Canada, 2017;pp. 763-772. custom:[[[-]]] - 16 S. Xu, "Research and implementation of paraphrasing recognition technology for question-and-answer system,"
*Harbin Institute of TechnologyHarbin, China*, 2009.custom:[[[-]]] - 17 M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, "From word embeddings to document distances," in
*Proceedings of the 32nd International Conference on Machine Learning*, Lille, France, 2015;pp. 957-966. custom:[[[-]]] - 18 Y. Guan, X. Wang, Q. Wang, "A new measurement of systematic similarity,"
*IEEE Transactions on SystemsMan, and Cybernetics-Part A: Systems and Humans*, vol. 38, no. 4, pp. 743-758, 2008.doi:[[[10.1109/TSMCA.2008.918611]]] - 19 Y. Guan, X. Wang, Q. Wang, "Measurement of system similarity," in
*Proceedings of China Computational Linguistics Conference (CCL)*, Nanjing, China, 2005;pp. 341-347. custom:[[[-]]] - 20
*Wikimedia Chinese corpus (Online). Available:*, https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 - 21
*Word2VEC_java (Online). Available:*, https://github.com/NLPchina/Word2VEC_java - 22
*wmd4j is a Java library for calculating Word Mover's Distance (WMD) (Online). Available:*, https://github.com/crtomirmajer/wmd4j - 23
*Word Sentence Similarity Code (Online). Available:*, https://download.csdn.net/download/u011001835/9849524 - 24
*Word2VEC (Online). Available:*, https://github.com/jsksxs360/Word2Vec