## Thanh Ho* ** and Tran Duy Thanh* **## |

Feature vectors | Topics | Users | ||||||
---|---|---|---|---|---|---|---|---|

T-0 | T-1 | T-2 | T-3 | T-4 | T-5 | T-6 | ||

[TeX:] $$\overrightarrow{v_{1}}$$ | 0.64444 | 0.34545 | 0.46826 | 0 | 0.33721 | 0 | 0 | Mr.tajkjd |

[TeX:] $$\overrightarrow{v_{2}}$$ | 0.30435 | 0.44565 | 0.33333 | 0.30435 | 0.33333 | 0.52941 | 0 | dsvantan |

[TeX:] $$\overrightarrow{v_{3}}$$ | 0.39601 | 0.48718 | 0 | 0.35484 | 0 | 0.38462 | 0 | nguyen.nhi.334491 |

[TeX:] $$\overrightarrow{v_{4}}$$ | 0.34694 | 0.40741 | 0 | 0.39227 | 0 | 0.36000 | 0 | trang.harry.7 |

[TeX:] $$\overrightarrow{v_{5}}$$ | 0 | 0.35135 | 0 | 0.41935 | 0 | 0.31429 | 0 | anna.vy.334 |

[TeX:] $$\overrightarrow{v_{6}}$$ | 0 | 0.36000 | 0 | 0.33333 | 0 | 0.44828 | 0.40741 | haianh.nguyen.52012 |

[TeX:] $$\overrightarrow{v_{7}}$$ | 0.48718 | 0.32431 | 0 | 0 | 0 | 0 | 0.31034 | quyvan.pham.54 |

[TeX:] $$\overrightarrow{v_{8}}$$ | 0.40741 | 0.31034 | 0 | 0 | 0 | 0.41772 | 0 | su.heo.1656 |

[TeX:] $$\overrightarrow{v_{9}}$$ | 0.35135 | 0.33333 | 0.40741 | 0 | 0.30923 | 0.34545 | 0 | phuc.hanh.9678 |

[TeX:] $$\overrightarrow{v_{10}}$$ | 0.64557 | 0.90000 | 0.34884 | 0.58974 | 0.33354 | 0 | 0.77465 | [TeX:] $$GiámDốcTaiChinh TiềmNăng$$ |

There were 921,310 discussed messages, including the content and comments that were collected, and 121,349 user accounts on social networks between 2016 and 2019. This research was concerned with the information about the user’s ID when it was issued when joining the website, the user’s name, the message, the message sender and receiver, and the time factor. After the training data has been learned via the TART model, the collected results are the set of interested topic vectors for each period.

Table 1 presents a representative set of ten interested topic vectors for six topics (T-0 to T-6) of ten participants in January 2019. Each vector has seven components, each of which has a level of interest for each topic. Specifically, the data sample in Table 1 is the sample of interested topic vectors of users on social networks and a result sample of the TART model.

Let [TeX:] $$C_{i}$$ be a cluster of the Kohonen output layer, [TeX:] $$C_{i}$$ is created by calculating the distance from the input vector to the corresponding weight vector to that cluster. The input vector is then assigned to the cluster with the smallest distance using the Kohonen method. The result is that each neuron in the output layer corresponding to a set of objects with attributes (users, interested topics) belongs to the neuron corresponding to each cluster (community).

- The Kohonen output layer size: [TeX:] $$14 \times 14$$ (196 neurons).

- Each input vector has 15 elements corresponding to 15 topics.

- Time: January 2019.

- The number of users participating in January 2019: 2,244.

- Test result 1: the number of discovered communities is 41.

Each neuron in the Kohonen output layer in Fig. 2 is shaded according to the number of users participating in the community. Darker neurons indicate that more users participate in the community than lighter neurons. Shading can also indicate that a community does not have any users (the empty neurons mean the community does not exist). Each community contains two traits, the topic of interest and the number of users in the community. For example, in Fig. 2, community 13 at neuron 101 has 61 users who participate in and are interested in ten topics (see list of community 13 with topics presented in Fig. 3). Among these topics, the “Recruitment and Employment - Tuyển dụng việc làm” topic has the highest probability of 0.3401 and is followed by “School security – An ninh học đường” with the probability of 0.04454. “General education – Giáo dục” and “Union activities – Hoạt động đoàn hội” have probabilities of 0.02928 and 0.01975, respectively. The lowest probability of 0.00504 is for the “Study abroad – Du học” topic.

Fig. 4 shows the results of community discovery, including characteristics such as user involvement and the interested topic community. Community 13 has several topics in a variety of sectors that interest users. Fig. 5 presents the exploration results of the community with interested topic 11 in January 2019. The “Project – Dự án” topic related to “education” attracts many communities.

The table in Fig. 6 offers insights into the weight vectors and communities for each topic. One striking point is that community 2 has the highest probability for multiple topics, from T-1 to T-15, while the numbers of community 8 are moderately smaller. There are no remarkable differences in the probabilities of community 3, 7, and 10 for the 15 topics above, with figures ranging from 0.00103 to 0.17376. However, community 10 at the weight of T-13 has the smallest probability of 0.00437.

Apart from applying the Precision, Recall, and F1 scores to evaluate the test results, this article employs the root mean square standard deviation (RMSSTD) and R-squared (RS) values 26,27] to compare the results of the clustering method proposed in the paper with those of the algorithm for K-Medoids. The RMSSTD value is used to measure the quality of the collection algorithm by formula (2). For RMSSTD, a lower value indicates a better clustering.

where k is the number of clusters, p is the number of independent variables in the dataset, [TeX:] $$\bar{x}_{i j}$$ is the average of data of variable j, and cluster i, [TeX:] $$n_{i j}$$ is data in variables p and k clusters.

The average of RMSSTD is calculated based on 1,000 transactions for each dataset. Formula (3) calculates the average value of RMSSTD:

The RS value is used to consider significant differences in data objects between different clusters and in a highly similar cluster. If the RS value is 0, then there is no difference between clusters. In contrast, if the RS value is 1, then the clustering result is optimal. The RS value is calculated using formulas (4), (5), and (6):

where [TeX:] $$S S_{t}$$ is the sum of squares of distances between all variables, [TeX:] $$S S_{w}$$ is the sum of squares of distances between all data objects in the same cluster, where k is the number of clusters, p is the number of independent variables in the dataset, [TeX:] $$\bar{x}_{i j}$$ is the data average of variable j and cluster i, [TeX:] $$n_{i j}$$ is the amount of dataset in variable p and cluster k.

The average value of the RS is calculated based on 1,000 iterations of each dataset being performed. This value is calculated by formula (7).

**Evaluation by RMSSTD and RS values**

The dataset, which consists of vector sets from the results of the TART model (Table 1), and evaluation methods are used to test clustering methods to find the average values of RMSSTD and RS. The test was repeated 1,000 times to obtain stable, reliable results. The number of k clusters has also been changed to have more criteria for comparing different methods.

Table 2 shows average RMSSTD values. The Kohonen neural network method has lower RMSSTD values than the K-Medoids method. This means that the Kohonen neural network has a better performance than the K-Medoids algorithm.

In this experiment, two clustering algorithms are compared using RMSSTD and RS values of the actual dataset from the TART model results. The calculation shows that the Kohonen neural network algorithm yields the lowest RMSSTD values and the highest RS values. This indicates the Kohonen algorithm is better than the others. This can be explained by the fact that the datasets used in this research do not include noise or outlier data.

Table 2.

Cluster (k) | RMSSTD | RS | ||
---|---|---|---|---|

Kohonen | K-Medoids | Kohonen | K-Medoids | |

2 | 0.56032 | 0.67832 | 0.67632 | 0.61231 |

3 | 0.65235 | 0.76234 | 0.68932 | 0.62311 |

4 | 0.57642 | 0.65231 | 0.74350 | 0.65634 |

5 | 0.54324 | 0.58932 | 0.72341 | 0.66549 |

6 | 0.46352 | 0.49812 | 0.79831 | 0.75410 |

7 | 0.49482 | 0.57321 | 0.87322 | 0.81209 |

8 | 0.41521 | 0.46421 | 0.84321 | 0.78619 |

**Evaluation by Precision, Recall, and F1-score**

The precision between the two clusters, denoted P, reflects the query’s accuracy and is calculated using formula (8). The precision indicates the ratio between the number of correctly clustered messages. If P = 1, then the messages in cluster [TeX:] $$k_{i}$$ are in the messages of cluster mi. Given precision P, a is the common part of two comparison clusters b and c [28].

Recall [28] between two clusters [TeX:] $$m_{i} \text { and } k_{i}$$ is denoted R and calculated by formula (9). If R = 1, the messages in cluster [TeX:] $$m_{i}$$ belong to messages in cluster [TeX:] $$k_{i}:$$

Combining precision with recall yields the F1 score [27]:

According to Brew and im Walde [29], the evaluation method is as follows. First, corresponding to one cluster in the clustering result, the system will calculate the value of the F1-score with all clusters collected manually. The next step is picking out the highest F1-score and removing this cluster. The process continues the calculation for the rest of the clusters. The higher the total F1-score is, the more accurate it is for the cluster method. Below are the results of the F1-score corresponding to Table 3 during March 2019 and April 2019, with m = 5 clusters and k = 6 clusters.

The total MAX values of the F1-score in Table 3 are 3.77 compared to 5 and 4.08 compared to 5 during March 2019 and April 2019, respectively. This max value is high, proving the effectiveness of the community discovery (clustering) method proposed in the article by combining the Kohonen neural network method and the TART topic model to achieve high efficiency.

Table 3.

Kohonen (k) | Manual (m) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

March 2019 | April 2019 | |||||||||

[TeX:] $$m_{0}$$ | [TeX:] $$m_{1}$$ | [TeX:] $$m_{2}$$ | [TeX:] $$m_{3}$$ | [TeX:] $$m_{4}$$ | [TeX:] $$m_{0}$$ | [TeX:] $$m_{1}$$ | [TeX:] $$m_{2}$$ | [TeX:] $$m_{3}$$ | [TeX:] $$m_{4}$$ | |

[TeX:] $$k_{0}$$ | 0.52 | 0.46 | 0.32 | 0.72 | 0.78 | 0.43 | 0.67 | 0.76 | 0.47 | 0.78 |

[TeX:] $$k_{1}$$ | 0.85 | 0.71 | 0.19 | 0.54 | 0.32 | 0.84 | 0.43 | 0.47 | 0.39 | 0.00 |

[TeX:] $$k_{2}$$ | 0.00 | 0.65 | 0.58 | 0.81 | 0.00 | 0.45 | 0.79 | 0.34 | 0.85 | 0.35 |

[TeX:] $$k_{3}$$ | 0.79 | 0.00 | 0.72 | 0.23 | 0.54 | 0.72 | 0.00 | 0.00 | 0.52 | 0.62 |

[TeX:] $$k_{4}$$ | 0.56 | 0.42 | 0.16 | 0.00 | 0.82 | 0.29 | 0.78 | 0.82 | 0.63 | 0.48 |

[TeX:] $$k_{5}$$ | 0.52 | 0.76 | 0.00 | 0.29 | 0.21 | 0.00 | 0.45 | 0.21 | 0.00 | 0.31 |

MAX | 0.85 | 0.76 | 0.72 | 0.62 | 0.82 | 0.84 | 0.79 | 0.82 | 0.85 | 0.78 |

This study makes three important and practical scientific contributions to user experience and community discovery.

First, the topic model was applied to social network analysis to discover topics from messages on social networks. The paper proposes a method that combines the topic model with labeling based on topic taxonomy. This method serves as the foundation for further research on the discovery, content analysis, and labeling to offer fresh insights into users’ experiences through social networks.

Second, this article shows how the TART model can be applied to assess the role of the individual’s interest in a topic based on a temporal factor. This model plays an essential role in finding the relationship among individuals on social media within the topic model. The model’s output is a set of vectors that consist of individuals’ traits on social networks.

Third, a method was constructed and developed to discover the community’s interests using the TART topic model. This method helps to identify the groups of users who have the same topic of interest but whose degree of interest varies across topics for each period. In addition, the Kohonen neural network is trained to discover the community of users interested in each topic. This proposal is called the community discovery method based on the TART topic model and clustering method. In particular, the method of community discovery distributes topics by community, specific topics of interest, and their probabilities. The results of community discovery are visualized in the Kohonen output layer.

In future work, we will concentrate on analyzing the impact of the community’s topic spread on social networks. This analysis will aim to determine the path and the source of information. Further, we can build time systems (containing the overlap property) to analyze online social networks for different periods with the topic model and the big data solution. The model of discovering behaviors and customers’ experience in the tourism sector is pivotal, based on the big data platform and the topic model employed in this setting.

He received M.S degree in computer science from University of Information Technology, VNU-HCM, Vietnam in 2009 and Ph.D. degree in computer science from University of Information Technology, VNU-HCM, Vietnam. Dr. Ho is currently lecturer in Faculty of Information Systems, University of Economics and Law, VNU-HCM, Vietnam. His research interests are data mining, data analytics, business intelligence, social network analysis and big data.

He received his M.S. degree in computer science from University of Information Technology, VNU-HCM, Vietnam. Mr. Tran is currently lecturer in Faculty of Information Systems, University of Economics and Law, VNU-HCM, Vietnam. His research interests are social network analysis, big data, AI and robotics.

- 1 C. C. Aggarwal,
*Social Network Data Analytics*, MA: Springer, Boston, 2011.custom:[[[-]]] - 2 L. Berkani, S. Belkacem, M. Ouafi, A. Guessoum, "Recommendation of users in social networks: A semantic and social based classification approach,"
*Expert Systems*, no. e12634, 2020.doi:[[[10.1111/exsy.12634]]] - 3 C. C. Aggarwal, K. Subbian, "Event detection in social streams," in
*Proceedings of the 2012 SIAM International Conference On Data Mining*, Anaheim, CA, 2012;pp. 624-635. custom:[[[-]]] - 4 C. Li, W. K. Cheung, Y. Ye, X. Zhang, D. Chu, X. Li, "The author-topic-community model for author interest profiling and community discovery,"
*Knowledge and Information Systems*, vol. 44, no. 2, pp. 359-383, 2015.doi:[[[10.1007/s10115-014-0764-9]]] - 5 D. Zhou, I. Councill, H. Zha, C. L. Giles, "Discovering temporal communities from social network documents," in
*Proceedings of the 7th IEEE International Conference on Data Mining (ICDM)*, Omaha, NE, 2007;pp. 745-750. custom:[[[-]]] - 6 N. Pathak, C. DeLong, K. Erickson, A. Banerjee, "Social topic models for community extraction,"
*Department of Computer Science and EngineeringUniversity of Minnesota, Minneapolis, MN*, 2008.custom:[[[-]]] - 7 X. Wang, N. Mohanty, A. McCallum, "Group and topic discovery from relations and their attributes,"
*Advances in Neural Information Processing Systems*, vol. 18, pp. 1449-1456, 2006.custom:[[[-]]] - 8 X. Wang, N. Mohanty, A. McCallum, "Group and topic discovery from relations and their attributes,"
*Advances in Neural Information Processing Systems*, vol. 18, pp. 1449-1456, 2006.custom:[[[-]]] - 9 A. Beykikhoshk, O. Arandjelovic, D. Phung, S. V enkatesh, "Discovering topic structures of a temporally evolving document corpus,"
*Knowledge and Information Systems*, vol. 55, no. 3, pp. 599-632, 2018.doi:[[[10.1007/s10115-017-1095-4]]] - 10 L. C. Freeman, "Visualizing social networks,"
*Journal of Social Structure(Online). Available: https://www.cmu.edu/joss/content/articles/volume1/Freeman.html*, 2000.custom:[[[-]]] - 11 H. H. Kim, H. Y. Rhee, "An ontology-based labeling of influential topics using topic network analysis,"
*Journal of Information Processing Systems*, vol. 15, no. 5, pp. 1096-1107, 2019.custom:[[[-]]] - 12 Z. Yin, L. Cao, Q. Gu, J. Han, "Latent community topic analysis: Integration of community discovery with topic modeling,"
*ACM Transactions on Intelligent Systems and Technology (TIST)*, vol. 3, no. 4, pp. 1-21, 2012.doi:[[[10.1145/2337542.2337548]]] - 13 T. Ho, P. Do, "Analyzing the changes in online community based on topic model and self-organizing map,"
*International Journal of Advanced Computer Science and Applications (IJACSA)*, vol. 6, no. 7, pp. 100-108, 2015.custom:[[[-]]] - 14
*D. M. Sharma and M. M. Baig, 2015 (Online). Available from:*, https://www.researchgate.net/profile/Durgesh_Sharma8/publication/325120893_Using_Data_Mining_For_Prediction_A_Conceptual_Analysis/links/5ef35b3d92851c35353ba7c4/Using-Data-Mining-For-Prediction-A-Conceptual-Analysis.pdf - 15
*H. Fani, F. Zarrinkalam, X. Zhao, Y. Feng, E. Bagheri, and W. Du, 2015 (Online). Available:*, https://arxiv.org/abs/1509.04227 - 16 M. Steyvers, P. Smyth, M. Rosen-Zvi, T. Griffiths, "Probabilistic author-topic models for information discovery," in
*Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, Seattle, W A, 2004;pp. 306-315. custom:[[[-]]] - 17 T. Yang, Y. Chi, S. Zhu, Y. Gong, R. Jin, "Detecting communities and their evolutions in dynamic social networks: a Bayesian approach,"
*Machine Learning*, vol. 82, no. 2, pp. 157-189, 2011.custom:[[[-]]] - 18
*T. Griffiths, 2002 (Online). Available:*, https://people.cs.umass.edu/~wallach/courses/s11/cmpsci791ss/readings/griffiths02gibbs.pdf - 19
*J. Singh and A. K. Singh, Annals of Mathematics and Artificial Intelligence, 2020.*, https://doi.org/10.1007/s10472-020-09709-z - 20 T. Ho, P. Do, "Social network analysis based on topic model with temporal factor,"
*International Journal of Knowledge and Systems Science (IJKSS)*, vol. 9, no. 1, pp. 82-97, 2018.custom:[[[-]]] - 21 H. A. Abdelbary, A. M. ElKorany, R. Bahgat, "Utilizing deep learning for content-based community detection," in
*Proceedings of 2014 Science and Information Conference*, London, UK, 2014;pp. 777-784. custom:[[[-]]] - 22 T. Kohonen, "Self-organized formation of topologically correct feature maps,"
*Biological Cybernetics*, vol. 43, no. 1, pp. 59-69, 1982.custom:[[[-]]] - 23 S. Haykin,
*Neural Networks: A Comprehensive Foundation, 2nd ed*, NJ: Prentice-Hall. pp.443-465, Upper Saddle River, pp. 1999. 443-465, 1999.custom:[[[-]]] - 24 Kohonen T, "Self-Organization and Associative Memory,"
*""Springer, Berlin*, 1984.custom:[[[-]]] - 25 T. Joachims, "Transductive inference for text classification using support vector machines," in
*Proceedings of the 16th International Conference on Machine Learning (ICML)*, Bled, Slovenia, 1999;pp. 200-209. custom:[[[-]]] - 26 M. Halkidi, Y. Batistakis, M. Vazirgiannis, "Cluster validity methods: part I,"
*ACM SIGMOD Record*, vol. 31, no. 2, pp. 40-45, 2002.doi:[[[10.1145/565117.565124]]] - 27 M. Halkidi, Y. Batistakis, M. Vazirgiannis, "Clustering validity checking methods: Part II,"
*ACM SIGMOD Record*, vol. 31, no. 3, pp. 19-27, 2002.doi:[[[10.1145/601858.601862]]] - 28 T. Fawcett, "An introduction to ROC analysis,"
*Pattern Recognition Letters*, vol. 27, no. 8, pp. 861-874, 2006.doi:[[[10.1016/j.patrec.2005.10.010]]] - 29 C. Brew, S. S. im Walde, "Spectral clustering for German verbs," in
*Proceedings of the 2002 Conference on Empirical Methods Natural Language Processing (EMNLP)*, Philadelphia, PA, 2002;pp. 117-124. custom:[[[-]]]