Research on Risk Assessment of Internal Audit Information in Enterprises through Data Mining

Lu Xia; Tiantian Wang

doi:10.3745/JIPS.04.0361

ISSN: 2092-805X

Volume 21, No 6 (2025), pp. 575 - 584

10.3745/JIPS.04.0361

Lu Xia and Tiantian Wang

Research on Risk Assessment of Internal Audit Information in Enterprises through Data Mining

Abstract: This paper initially segmented the audit information samples using a clustering algorithm. Subsequently, a backpropagation neural network (BPNN) algorithm enhanced by the beetle antennae search (BAS) algorithm was used for risk assessment. Financial report data was crawled from listed companies for a case analysis to assess the impact of the number of clustering centers on the clustering algorithm and the effect of the activation function type on the improved BPNN algorithm. Additionally, the audit information risk identification performance of the support vector machine (SVM), traditional BPNN, sparse autoencoder-BPNN, and the improved BPNN algorithm was compared. The findings revealed that the clustering algorithm demonstrated optimal sample division performance when utilizing two clustering centers. Moreover, the improved BPNN algorithm exhibited superior performance under the sigmoid activation function, outperforming both SVM and traditional BPNN algorithms.

Keywords: Audit Information , Data Mining , Deep Learning , Risk Identification

1. Introduction

Enterprises engage in various fiscal, financial, and business management activities during their development process, and maintaining healthy fiscal, financial, and business management practices is crucial for enterprise growth [1]. To ensure the safety of economic activities within an enterprise, in addition to the need for companies to establish reasonable regulations themselves, it is equally important to supervise their economic activities [2]. Audit activities involve examining a supervised unit's fiscal, financial, and business management activities and related information utilizing authorized methods and tools in compliance with laws and regulations [3]. In the current era of rapid information technology advancement, internal auditing within enterprises has become an important component of enterprise risk management and internal control. Traditional auditing methods often rely on expert experience and manual analysis, which can be subjective and inefficient. With the emergence of big data technology, the application of data mining technology in assessing internal audit information risks within enterprises has gained increasing attention [4]. Using models and algorithms, data mining technology can efficiently screen, extract, and analyze vast amounts of data to assist corporate audit departments in identifying potential risks efficiently, thereby enhancing audit work efficiency and accuracy. Danchenko et al. [5] analyzed the research results of domestic and foreign researchers in the field of IT audit project management and used Ishikawa cause-and-effect diagram method to determine the factors that lead to increased time and cost of IT audit projects. Hao and Qiu [6] employed the random forest (RF) algorithm to construct a classification and identification model for audit risks. They found it had high prediction accuracy and improved robustness, which can enhance the risk resistance capacity of Chinese listed companies. Semenets [7] proposed scientifically grounded suggestions for enhancing enterprise risk assessment methods within internal auditing systems and also developed an algorithm for identifying internal audit risk in the entity's controlled environment. This study used a clustering algorithm to initially segment audit information samples and applied the backpropagation neural network (BPNN) algorithm enhanced by the beetle antennae search (BAS) algorithm for risk assessment. Subsequently, the financial report data of listed companies was crawled for a case study.

2. Risk Assessment of Audit Information using Data Mining Technology

As a distinctive technical tool within information technology, data mining technology can significantly contribute to the risk assessment of audit information [8]. Within audit projects, data mining technology enables auditors to extract essential information from vast data, facilitating accurate risk assessment of audit information.

2.1 Clustering Algorithms

The clustering algorithm is a commonly used data mining technique, and its primary role is to classify big data according to the degree of similarity. It is an unsupervised classification model [9]. The steps for clustering and classifying the risk of audit information are shown below.

Step 1: The number of category items for clustering, i.e., K, is set according to the demand. Each category item randomly generates a clustering center, and K clustering centers are generated.

Step 2: According to the principle of proximity allocation, the centers are allocated to the corresponding category items [10]. The formula for the Euclidean distance is:

(1)

[TeX:] $$d_{a, k}=\sqrt{\sum_{i=1}^o\left(a_i-k_i\right)^2}$$

where [TeX:] $$d_{a, k}$$ is the Euclidean distance between audit information sample a and clustering center [TeX:] $$k, a_i, \text{ and } k_i$$ are the i-th dimensional feature vector of a and k, i.e., the i-th audit risk assessment indicator, and o is the number of indicators.

Step 3: After the audit information samples have been allocated, the mean value method is used to recalculate the new clustering center for each category, i.e., the mean value of each feature indicator in the category is used as the feature of the new clustering center.

Step 4: Return to Step 2 until the clustering center converges to stability.

Step 5: The audit information samples in each category are labeled.

The advantage of the clustering algorithm in classifying audit information samples is its capability to categorize samples without requiring pre-training of the algorithm. The classification of samples occurs within the iterative process, without the need to pre-determine the general tendencies of each sample beforehand. This simplicity in classification principle allows for high classification speeds. However, the clustering algorithm tends to divide current big data samples; as a result, the final classification outcomes primarily reflect relative differences between samples rather than fully capturing the actual risk levels. Moreover, when introducing a new audit information sample, it cannot be directly classified; instead, the new sample must be incorporated into the original sample set for clustering and iteration again. Therefore, the clustering algorithm is more suitable for classifying data with a fixed number of samples but has insufficient classification ability when faced with unknown new samples.

2.2 Neural Network Algorithms

In a predictive classification of audit information samples belonging to unknown risk categories, neural network algorithms within deep learning techniques can be utilized. These algorithms can make classification predictions of new samples using fitting laws [11]. Typically, the category labels of training samples are manually labeled. However, in order to ensure that the neural network can fit hidden patterns as much as possible during training, it is necessary to have a sufficient number of training samples. Manual labeling is inefficient and subjective. Therefore, this paper employs the clustering algorithm to segment the sample set and assign category labels.

Subsequently, the BPNN algorithm is trained using samples labeled with categories. During the training process, the feature indexes of the samples are initially fed into the hidden layer of the algorithm for forward computation. The results are then compared with the sample labels, and the weight parameters in the hidden layer are adjusted accordingly based on the differences observed. Usually, the gradient descent method is employed to adjust the weight parameters. However, this method is easy to converge at a local optimum due to the influence of the initial weight and learning rate [12]. To address this limitation, this paper introduces the BAS algorithm to optimize the weight parameters of the BPNN algorithm.

The BAS algorithm simulates the foraging behavior of the beetle, i.e., moving towards the most pungent odor based on the difference in odor intensity detected by its left and right antennae. In this study, the “coordinates” of the “beetle antennae” represent the parameter configuration of a neural network algorithm. The reciprocal of the forward computation error of the neural network algorithm at the “antennae coordinates” signifies the intensity of the odor sensed by the “antennae” [13]. According to the intensity of the odor felt by the left and right “antennae”, the iterative movement of the beetle is adjusted in the direction of the most pungent odor, the position coordinates of the beetle body is the actual parameter scheme of the neural network algorithm. The training process for the improved BPNN algorithm is as follows.

Step 1: The training samples are segmented and labeled with risk categories using the clustering algorithm

Step 2: Parameters are initialized.

Step 3: The initial position of the “beetle” is set according to the initial structural parameters of the BPNN algorithm (each parameter to be optimized is used as the positional coordinates of the “beetle”).

Step 4: The intensity of the odor felt by the left and right “antennae” is calculated according to the position of the “beetle.” The formula is:

(2)

[TeX:] $$\left\{\begin{array}{l} w_l=w+d_0 \frac{d i r}{2} \\ w_r=w-d_0 \frac{d i r}{2} \\ F_l=\frac{1}{\left|f\left(w_l, x\right)-y\right|} \\ F_r=\frac{1}{\left|f\left(w_r, x\right)-y\right|} \end{array}\right.$$

where w is the position of the “beetle,” which is the actual structural parameter scheme of the BPNN algorithm, [TeX:] $$w_l \text { and } w_r$$ are the position of the left and right antennae of the “beetle,” which represents the two temporary structural parameter schemes of the BPNN algorithm for guiding the iterative directions of the “beetle,” [TeX:] $$d_0$$ denotes the distance between the left and right antennae, dir is the normalized random vector, [TeX:] $$F_l \text { and } F_r$$ denote the intensity of the odor sensed under the position of the left and right antennae, x denotes the set of samples, y denotes the category label set corresponding to [TeX:] $$x, f\left(w_l, x\right) \text { and } f\left(w_r, x\right)$$ denote the forward computation functions of the BPNN algorithm by substituting [TeX:] $$w_l \text { and } w_r$$ into the BPNN algorithm.

Step 5: The position of the “beetle” is iterated based on the intensity of the odor perceived by the left and right antennae:

(3)

[TeX:] $$w^{\prime}= \begin{cases}w+\text { step } \cdot \operatorname{dir} \cdot\left(w_l-w_r\right) & F_l \lt F_r \\ w-\text { step } \cdot \operatorname{dir} \cdot\left(w_l-w_r\right) & F_l \gt F_r\end{cases}$$

where [TeX:] $$w^{\prime}$$ is the iterated position of the beetle and step is the iterative moving step length of the beetle.

Step 6: The iterated position of the “beetle” is substituted as the structural parameter scheme of the BPNN algorithm. The training sample is input for forward computation, and the difference between the computed result and the sample label is calculated. Whether the difference converges is determined [14]. If convergence to stability, the training ends; otherwise, return to Step 4 to continue iteration.

3. Case Analysis

3.1 Experimental Data

In order to facilitate the collection of data for enterprise internal audit risk assessment, this paper collected relevant data from listed companies. Listed companies were chosen due to their obligation to disclose financial reports to the stock market, which includes audit opinions and indicators used for assessing audit information risk. Financial statement information from A-share companies between 2020 and 2023 were extracted, and the statement data of the financial and insurance industry, companies with abnormal reports, and companies that have been delisted were excluded. Subsequently, to evaluate the classification performance of the clustering algorithm and the improved BPNN algorithm, the audit information risk levels of the collected samples were labeled. The measuring criteria for labeling are as follows. Financial report samples receiving standard unqualified opinions or unqualified opinions with an emphasis but penalized by the China Securities Regulatory Commission were categorized as high audit information risk; all other samples were classified as low audit information risk.

3.2 Risk Assessment Indicators for Enterprise Audit Information

The indicators utilized for evaluating the risk of corporate audit information are classified into financial and non-financial categories, as depicted in Table 1. Financial indicators encompass solvency, profitability, operational capacity, and development capacity, which can be quantitatively measured. Non-financial indicators encompass corporate management, regulatory environment, and internal control, which are unrelated to the enterprise's financial standing but serve as assessments of the internal management practices within the organization.

3.3 Experimental Setup

The parameters of the improved BPNN algorithm were set as follows. The input layer contained 19 nodes, the hidden layer consisted of 128 nodes, and the output layer comprised two nodes. The distance between the left and right antennae of the “beetle” was set to 0.1, the moving step length of the “beetle” was set to 0.01, and the maximum number of training iterations was set to 500.

In the experimental phase, the clustering algorithm's performance was initially evaluated under two, three, four, and five clustering centers. Subsequently, the impact of different activation function types in the hidden layer of the improved BPNN algorithm on its performance was examined. Three activation function types were considered: tahn, sigmoid, and relu.

Table 1.

Indicators for evaluating projects

Type of indicator	Level 1 indicator	Level 2 indicator
Financial indicator	Solvency	Current assets ratio
		Cash ratio
		Asset-liability ratio
	Profitability	Return on net assets
		Net interest rate on total assets
		Stock return rate
	Operating ability	Cargo turnover ratio
		Total asset turnover ratio
		Accounts receivable turnover ratio
	Development capacity	Year-on-year revenue growth
		Stock net asset growth rate
		Total asset growth rate
Non-financial indicator	Company management	Shareholding concentration indicator
		Whether or not two positions are combined
		Number of independent directors
		Regulatory shareholding ratio
	Regulatory environment	State-owned or not
	Regulatory environment	Audit opinion from the previous year
	Internal control	Number of directors

3.4 Evaluation Criteria

The clustering algorithm was measured using the silhouette coefficient [15], whose computational formula is:

(4)

[TeX:] $$s=\frac{\sum_{i=1}^N \frac{b(i)-a(i)}{\max (a(i), b(i))}}{N}$$

where a(i) is the average distance between sample i and other samples in the cluster, b(i) is the average distance between sample i and other samples in the nearest cluster, N is the number of samples, and s is the silhouette coefficient.

(5)

[TeX:] $$\left\{\begin{array}{l} P=\frac{T P}{T P+F P} \\ R=\frac{T P}{T P+F N} \\ F=\frac{2 \cdot P \cdot R}{P+R} \end{array}\right.$$

where P denotes the precision, R denotes the recall rate, F is a combined evaluation of the precision and recall rate, TP is the number of risky samples that are predicted as risky, FP is the number of non-risky samples that are predicted as non-risky, and F is the number of non-risky samples that are predicted as risky.

3.5 Experimental Results

The clustering algorithm's clustering performance was initially assessed under varying numbers of clustering centers, and their silhouette coefficients are presented in Table 2. The table illustrates that as the number of clustering centers increased, the silhouette coefficient of the clusters decreased. The results suggested that the clustering algorithm achieved optimal effectiveness in clustering audit information samples when the number of clustering centers was set to 2, i.e., the audit information sample should be divided into high-risk and low-risk samples.

The impacts of three different types of activation functions on the improved BPNN algorithm are presented in Table 3. The table illustrates that the performance of the improved BPNN algorithm in identifying and classifying audit information risk varied under the three activation functions. When the activation function was set to sigmoid, the improved BPNN algorithm demonstrated the most effective performance.

Table 2.

Silhouette coefficients of the clustering algorithm under different numbers of clustering centers

Number of clustering centers	2	3	4	5
Silhouette coefficient	0.68	0.51	0.42	0.33

Table 3.

Effect of different activation functions on the improved BPNN algorithm

	Precision	Recall rate	F
Tahn	0.84	0.83	0.83
Sigmoid	0.97	0.96	0.96
Relu	0.82	0.84	0.83

Table 4.

Performance comparison of three different audit information risk identification algorithms

	Precision	Recall rate	F
SVM	0.68	0.69	0.68
Traditional BPNN	0.84	0.83	0.83
BAS-BPNN	0.97	0.96	0.96
SAE-BPNN	0.86	0.85	0.86

The support vector machine (SVM), traditional BPNN, sparse autoencoder (SAE)-BPNN, and BAS-optimized BPNN algorithms were tested and compared (Table 4). The table indicates that, among the three algorithms, the BPNN algorithm improved by BAS demonstrated the highest performance in recognizing the risk of audit information, followed by the traditional BPNN algorithm. The SVM algorithm showed the poorest performance. Ten experts were invited to audit the audit information samples manually. However, the efficiency of expert manual auditing was limited due to the large number of samples in the test set; therefore, only partial samples were selected for testing and comparison (Table 5). In comparison with the expert manual audit results, only the audit results of the BAS-BPNN algorithm remained consistent. The other two algorithms exhibited differences in the audit results, especially the SVM algorithm. Meanwhile, 10 experts were invited to conduct manual audits on the audit information samples. Due to the large number of samples in the test set, the efficiency of expert manual audits was limited. Moreover, restricted by the space, only some of the samples were selected for testing and comparison (Table 5). As shown in Table 5, only the audit results of the BAS-BPNN algorithm were consistent with the results of expert manual audits. The audit results of the traditional BPNN and BAS-BPNN algorithms showed differences, and the audit results of the SVM algorithm had the most differences.

The SVM algorithm, despite utilizing training samples for supervised training, uses a hyperplane, i.e., support vector, which is a linear division law that is difficult to effectively fit the nonlinear laws of risk differences between audit samples. In contrast, as a deep learning technology, the traditional BPNN

Table 5.

Risk identification results of four audit risk identification algorithms and expert manual audit

Enterprise number	Experts	SVM	Traditional BPNN	BAS-BPNN	SAE-BPNN
1	Low risk	High risk	Low risk	Low risk	Low risk
12	Low risk	High risk	High risk	Low risk	Low risk
15	High risk	High risk	High risk	High risk	High risk
26	High risk	Low risk	Low risk	High risk	Low risk
38	Low risk	Low risk	Low risk	Low risk	High risk
59	High risk	Low risk	High risk	High risk	High risk

algorithm utilizes the activation function in the hidden layer and multiple neural nodes to fit nonlinear laws, resulting in more effective risk identification. In the improved BPNN algorithm, the BAS algorithm was used to replace the gradient descent method in the traditional BPNN algorithm for weight parameter adjustments. With the help of the random direction guidance from the two antennae of the “beetle” in the BAS algorithm, it avoided the weight parameters of the BPNN algorithm from falling into local optima, thus achieving the best performance in risk identification for audit information.

This paper used the BPNN algorithm to identify the risks of internal audit in enterprises and used the BAS algorithm to improve it in order to enhance the accuracy of risk identification. The BPNN algorithm is a relatively basic algorithm in deep learning algorithms, so it also has a relatively large expansion space. For example, the way of adjusting parameters during the training process of the BPNN algorithm can be improved, or it can be combined with other algorithms to obtain deeper features, so that the BPNN algorithm can fit more accurate nonlinear laws. Deep learning algorithms can deeply explore the hidden patterns in big data. In this paper, they can be used to identify the risks of internal audit. However, deeply exploring the hidden patterns in big data means that some privacy information will also be dug out. Therefore, attention should be paid to data desensitization during the process of using deep learning algorithms to process data.

4. Conclusion

In this paper, the audit information samples were initially divided using a clustering algorithm, followed by risk assessment conducted using the BPNN algorithm optimized by the BAS algorithm. Subsequently, a case analysis was carried out by extracting financial report data from listed enterprises. During the analysis process, the performance of the clustering algorithm was evaluated first, and then the impact of activation function types on the improved BPNN algorithm was examined. Finally, the performance of the SVM, traditional BPNN, SAE-BPNN, and improved BPNN algorithms was compared. The clustering algorithm demonstrated optimal sample clustering performance when the number of clustering centers was set to 2. The improved BPNN algorithm achieved the best performance in identifying audit information risks when the activation function was set to sigmoid. The BPNN algorithm enhanced by the BAS algorithm exhibited superior performance in identifying audit information risks, followed by the traditional BPNN algorithm, and the SVM algorithm performed the least effectively. The comparison between the algorithm audit results and expert manual audit results showed that the improved BPNN algorithm performed the best, followed by the traditional and SAE-improved BPNN algorithms, and the SVM algorithm performed the worst. Moreover, the audit results of the BAS-BPNN algorithm fit the best with the manual audit results of experts, and the SVM algorithm fit the worst.

Conflict of Interest

The authors declare that they have no competing interests.

Funding

This study was supported by Anhui Province Humanities and Social Sciences Key Project "Research on Key Technologies and Application Scenarios of Intelligent Accounting" (No. 2024AH052439).

Biography

Lu Xia

https://orcid.org/0009-0003-3474-0525

She graduated from Anhui University in 2015. She is working at Anhui Sanlian University. She is engaged in the research of accounting and financial accounting.

Biography

Tiantian Wang

https://orcid.org/0009-0004-4194-3491

He born in June 1989, graduated from Anhui University of Technology in 2013. He is working at Anhui Sanlian University. He is engaged in the research and application of intelligent accounting.

References

1 A. Dewi, Y . Latief, and L. Sagita, "Activity and risk identification in audit process on integrated management system to increase performance efficiency of construction services organization in Indonesia," IOP Conference Series: Earth and Environmental Science, vol. 426, article no. 012014, 2020. https://doi.org/10. 1088/1755-1315/426/1/012014doi:[[[10.1088/1755-1315/426/1/01]]]
2 D. K. Chronopoulos, L. M. Rempoutsika, and J. O. S. Wilson, "Audit committee oversight and bank financial reporting quality," Journal of Business Finance and Accounting, vol. 51, no. 1-2, pp. 657-687, 2024. https://doi.org/10.1111/jbfa.12738doi:[[[10.1111/jbfa.12738]]]
3 R. D. Frank, "Risk in trustworthy digital repository audit and certification," Archival Science, vol. 22, pp. 4373, 2022. https://doi.org/10.1007/s10502-021-09366-zdoi:[[[10.1007/s10502-021-09366-z]]]
4 S. Xu, "Model for evaluating the commercial banks financial risk with interval grey uncertain linguistic variables," Journal of Intelligent & Fuzzy Systems, vol. 28, no. 2, pp. 767-773, 2015. https://doi.org/10.3233/ IFS-141358doi:[[[10.3233/IFS-141358]]]
5 E. Danchenko, V . Alba, R Berezensky, and O. Savina, "Identification and risk analysis of it-audit projects," Bulletin of NTU "KhPI" Strategic Management, Portfolio, Program and Project Management, vol. 1, no. 3, pp. 24-31, 2021. https://doi.org/10.20998/2413-3000.2021.3.4doi:[[[10.20998/2413-3000.2021.3.4]]]
6 Y . Hao and F. Qiu, "Research on the application of DM technology with RF in enterprise financial audit," Mobile Information Systems, vol. 2022, article no. 4051469, 2022. https://doi.org/10.1155/2022/4051469doi:[[[10.1155//4051469]]]
7 A. Semenets, "Formalization of theoretical tools for risks of internal audit system identification," Fundamental and Applied Researches in Practice of Leading Scientific Schools, vol. 31, no. 1, pp. 182-189, 2019.custom:[[[-]]]
8 C. Grisse and T. Nitschka, "On financial risk and the safe haven characteristics of Swiss franc exchange rates," Journal of Empirical Finance, vol. 32, pp. 153-164, 2015. https://doi.org/10.1016/j.jempfin.2015.03.006doi:[[[10.1016/j.jempfin.2015.03.006]]]
9 S. R. Cardoso, A. P. Barbosa-Povoa, and S. Relvas, "Integrating financial risk measures into the design and planning of closed-loop supply chains," Computers & Chemical Engineering, vol. 85, pp. 105-123, 2016. https://doi.org/10.1016/j.compchemeng.2015.10.012doi:[[[10.1016/j.compchemeng.2015.10.012]]]
10 L. Kumar, A. Jindal, and N. R. Velaga, "Financial risk assessment and modelling of PPP based Indian highway infrastructure projects," Transport Policy, vol. 62, pp. 2-11, 2018. https://doi.org/10.1016/j.tranpol. 2017.03.010doi:[[[10.1016/j.tranpol.2017.03.010]]]
11 H. Wan, Q. Yu, J. Ding, and K. Liu, "Students' behavior analysis under the Sakai LMS," in Proceedings of 2017 IEEE 6th International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Hong Kong, 2017, pp. 250-255. https://doi.org/10.1109/TALE.2017.8252342doi:[[[10.1109/TALE.2017.8252342]]]
12 M. Cui, "Big data medical behavior analysis based on machine learning and wireless sensors," Neural Computing & Applications, vol. 34, no. 12, pp. 9413-9427, 2022. https://doi.org/10.1007/s00521-021-06369-wdoi:[[[10.1007/s00521-021-06369-w]]]
13 L. Liu, B. Zhao, and Y . Rao, "On the cognitive load of online learners with multi-level data mining," International Journal of Information and Communication Technology Education (IJICTE), vol. 18, no. 2, 115, 2022. https://doi.org/10.4018/ijicte.314225doi:[[[10.4018/ijicte.314225]]]
14 I. Saric-Grgic, A. Grubisic, L. Seric, and T. J. Robinson, "Student clustering based on learning behavior data in the intelligent tutoring system," International Journal of Distance Education Technologies (IJDET), vol. 18, no. 2, pp. 73-89, 2020. https://doi.org/10.4018/IJDET.2020040105doi:[[[10.4018/IJDET.200105]]]
15 S. Heng, "A new intelligent optimization network online learning behavior in multimedia big data environment," International Journal of Mobile Computing and Multimedia Communications, vol. 8, no. 3, pp. 21-31, 2017. https://doi.org/10.4018/IJMCMC.2017070102doi:[[[10.4018/IJMCMC.070102]]]

Received: March 5 2025

Revision received: June 10 2025

Accepted: June 24 2025

Published (Print): December 31 2025

Published (Electronic): December 31 2025

Corresponding Author: Lu Xia , luxxial@hotmail.com

Lu Xia, Anhui Sanlian University, Hefei, Anhui, China, luxxial@hotmail.com

Tiantian Wang, Anhui Sanlian University, Hefei, Anhui, China, wtttianw@outlook.com