1. Introduction
1.1 Overview
Variance is one widely used parameter for describing the characteristics of a signal. With the development of information technology, the speed and frequency of communication is increasing and the needing for high speed signal analysis and processing is keeping up. Besides, various kinds of large-scale and fine-grained data in different fields such as medical, transportation, education, energy, aerospace, etc., have been generated and collected, and needing processing efficiently.
Some common basic statistics need to be calculated during data analysis to describe some basic related statistical characteristics of the data. Among them, the mean and variance of data [1] are two commonly used basic statistics. The former reflects the information of the overall center position, and the later reflects the overall dispersion. Furthermore, there are other commonly used statistics, such as sample standard deviation, sample skewness, sample kurtosis, and so on, which are also widely used in different aspects of society.
1.2 Literature Review
According to the brief review of different kinds of literature, the variance and the mean value of a signal is commonly analyzed in practical problems. For example, sample variance is identified in nonlinear systems [2], automatic detection [3,4], financial and corporate decision-making, investment [5-11], analysis of variance [12-15], image and pattern recognition [16-19], linguistic analysis [20-22] and other fields.
It can be seen that mean and the variance value plays an extremely important role in data analysis. However, in most cases, the variance is calculated by theoretical definition. Not only does this approach require access to the entire data sequence for each computation, but it also requires more storage resources. However, when calculating the variance of time-varying data with large data capacity and real-time updates of data, it not only wastes the storage resources of the computer but also fails to output the calculated results in time. Therefore, many improved algorithms for variance calculation have been proposed.
Chen and Gao [23] took the derivative of each parameter in the k-order origin moment of the usual discrete distribution, and obtained the unified recursive formula for the k-order origin moment of binomial distribution, Poisson distribution and geometric distribution through identical deformation. Chen [24] also used the derivative method to derive the recursive formula of the k-order origin moment of binomial distribution, Poisson distribution, and geometric distribution with differential form, and gave the recursive formula of the k-order origin moment of the general form of these three distributions. Similarly, Chen and Ma [25], with the help of the derivation strategy, proposed a new formula for calculating k-order central moment in general form, according to the recursive formula of the k-order central moment in the differential form of binomial distribution, Poisson distribution, and geometric distribution. By constructing a special class of polynomials and series, Chen [26] proposed a new unified formula for calculating the k-order central moment of binomial distribution, Poisson distribution, and geometric distribution. In the study of Fan [27], he introduced in detail the recursive method to calculate the sample mean, variance, and covariance after the change of sample size.
Based on filtering and smoothing, Liu and Sun [28] proposed an efficient method for calculating the variance of non-zero mean data. Deng et al. [29] first proposed a one-step recursion method of variance by studying the classical variance algorithm. The algorithm can efficiently calculate the variance of the data sequence whose observed data capacity changes rapidly. However, some special data sequences have a fixed length, and the observed values change in real-time, such as satellite timing system and realtime data analysis [30-33]. At this time, the classical definition algorithm and the improved algorithm proposed in the above literature also have some shortcomings in dealing with this problem.
For the above problem, the existing literature has not proposed an effective calculation method. Therefore, the main work of this paper is as follows. Firstly, the recursion method proposed in [29] is briefly reviewed. Then, a one-step recursive formula is proposed to dynamically calculate the variance of this special sequence. Secondly, based on one-step recursion formula, it is extended to a more general case. Namely, the one-step recursion formula is extended to multi-step recursion. Then, through numerical experiments, the one-step recursive formula and multi-step recursive formula are compared with the algorithm proposed in [29] for calculation error and calculation time. At last, this paper is summarized.
2. Variance Recursion Formula with the Variable Sequence Length
In this section, we prove the one-step recurrence formula proposed by Deng et al. [29] so as to introduce the variance recursion formula.
Definition 2.1. Let [TeX:] $$F_{n}=\left(f_{1,} f_{2,} f_{3}, \cdots f_{n-1,} f_{n}\right)$$ be a variable-length data sequence, [TeX:] $$\bar{F}_{n}$$ is the mean of the sequence, and its formula is
Theorem 2.1. Let [TeX:] $$F_{n} \text { and } \bar{F}_{n}$$ be as described in Definition 2.1, from which the recursive expression of [TeX:] $$\bar{F}_{n} \text { and } \bar{F}_{n-1}$$ can be obtained as
Proof.
Theorem 2.2. Let the variance definition formula be [TeX:] $$V_{n}=\frac{1}{n-1} \sum_{i=1}^{n}\left(f_{i}-\bar{F}_{n}\right)^{2},$$ and then we can get the variance recursion formula with the variable sequence length
Proof. It can be obtained from the calculation formula of the definition of variance
Formula (4) is expanded to obtain
Combined with the mean recurrence formula (3), the above Eq. (6) can be written as
Finally, by combining Eqs. (7) and (5), the recurrence formula of variance with variable sequence length can be obtained
According to the Eqs. (3) and (8), it is possible to quickly calculate the variance and mean of a numerical sequence whose length is constantly changing, which saves computing time and computer storage resources to a large extent.
3. Variance Recursion Formula Based on Sliding Window Mechanism
In some cases, there are data updates with invariant sequence length, but at that time, the variance recursion method with variable length cannot effectively calculate the variance of such data. For calculating the variance of such data, a one-step recursive formula for variance is derived using the idea of sliding windows. In addition, based on this, the formula is generalized to obtain an N-step recursive formula for calculating variance. Next, we first prove a one-step recursive formula.
3.1 The Recursion Formula of One-Step
Definition 3.1. Let the observation data sequence be [TeX:] $$S=\left(s_{1}, s_{2}, \cdots, s_{n-1}, s_{n}\right),$$ then S is an n-dimensional vector. Hence, the mean of the sequence is [TeX:] $$E(S)=\mu=\frac{1}{n} \sum_{i=1}^{n} s_{i},$$ and the variance is [TeX:] $$V(S)=\frac{1}{n-1} \sum_{i=1}^{n}\left(s_{i}-\mu\right)^{2}.$$
Let [TeX:] $$S=\left(s_{1}, s_{2}, \cdots, s_{n-1}, s_{n}\right) \text { be } S_{k} \text { and } S_{k+1}$$ respectively in the [TeX:] $$k \text { and } k+1$$ states, then the variance of the sequence in the [TeX:] $$k \text { and } k+1$$ states are [TeX:] $$V_{k} \text { and } V_{k+1},$$ respectively. And the calculation expression is as follows
In the k state, let [TeX:] $$s_{j}^{*} \text { replace } s_{j}$$ in the data sequence to get the data sequence [TeX:] $$S_{k+1}=\left(s_{1}, s_{2}, \cdots, s_{n-1}, s_{n}\right)$$ in the state [TeX:] $$k+1, \text { where } s_{j}=s_{j}^{*}.$$
Theorem 3.1. Let [TeX:] $$\mu_{1, k} \text { and } \mu_{1, k+1}, V_{1, k}(S) \text { and } V_{1, k+1}(S)$$ represent the mean and variance of the onestep recursive variance algorithm for the sequence in the [TeX:] $$k \text { and } k+1$$ states, respectively. Then the following conclusions are established
Proof. It follows from formulas (9) and (10)
The difference between Eqs. (15) and (16) is obtained as follows:
When the length of the sequence is fixed, the recurrence formula of the mean is
Bringing the mean recurrence formula (15) into Eq. (14), the intermediate process is as follows:
Simplification of the formula (16) to obtain a fast-recursive formula of the variance of fixed-length sequences
Thus Theorem 3.1 is proved.
3.2 The Recursion Formula of N-Step
Since the one-step recursive algorithm in the previous section has certain limitations in its use, it can only calculate the dynamic variance under a data change, so we generalize it to the N-step recursion.
Theorem 3.2. Let [TeX:] $$S_{k} \text { and } S_{k+1}$$ be as described above, and the N-step variance recursion formula of the data sequence from [TeX:] $$k \text { to } k+1$$ is
where, [TeX:] $$\mu_{N, k} \text { and } \mu_{N, k+1}, V_{N, k} \text { and } V_{N, k+1}$$ represent the mean and variance of the N-step variance recursion algorithm for the data sequence in the [TeX:] $$k \text { and } k+1$$ states, respectively. Specifically, [TeX:] $$\mu_{N, k+1}$$ is defined as
Proof. Let the formula for calculating the mean and variance of the data sequence in the state k are as follows
Suppose that the data sequence [TeX:] $$\left(s_{1}^{*}, s_{2}^{*}, \cdots, s_{N}^{*}\right)$$ \left(s_{1}^{*}, s_{2}^{*}, \cdots, s_{N}^{*}\right) [TeX:] $$N(1 \leq N \leq n)$$ is used to replace the data sequence [TeX:] $$\left(s_{1}, s_{2}, \cdots, s_{N}\right)$$ of the same length in [TeX:] $$S_{k},$$ and the new data sequence [TeX:] $$S_{k+1}$$ in the state of [TeX:] $$k+1$$ is obtained, and the variance of the sequence is set as
It can be obtained from Eqs. (20) and (21),
The difference between Eqs. (22) and (23) is obtained
Simplify the multivariate variance recursion calculation formula by simplification Eq. (24), then
When the length of the sequence is fixed, the mean in the k state is calculated as
The mean calculation formula in the k + 1 state is
From the above Eqs. (26) and (27), the N-step mean recurrence formula can be obtained, which is calculated as follows
Bring Eq. (28) into Eq. (25) to get the N-step variance recursion calculation formula.
Thus Theorem 3.2 is proved.
4. Numerical Simulation
The main content of this section is to verify the accuracy and efficiency of the recursive formula proposed in this paper through numerical experiments. The sample data used for this numerical experiment is divided into two groups, each containing 20,000 random numbers between (-1,1).
4.1 Comparison of the One-Step Recursive Algorithm
For the random sequence used for the one-step recursive algorithm comparison, the method of random replacement is used. Randomly replace one data in the sequence each time, and then use the defined algorithm, the algorithm proposed by Deng et al. [29] and the recursive algorithm proposed in this paper to calculate the variance of the sequence. A total of 40 replacement operations are performed. Record the calculation time used each time and compare it with the definition algorithm and the Deng's algorithm. The calculation results of the one-step recursive algorithm compared with the other two algorithms are shown in Figs. 1 and 2. As can be seen from the Fig. 1, the algorithm proposed in this paper has higher time efficiency than the traditional definition algorithm and Deng's optimization algorithm. From Fig. 2, the calculation accuracy of the proposed algorithm is also higher than that of Deng's optimization algorithm.
Time efficiency comparison of three algorithms (one-step recursive algorithm).
Calculation accuracy comparison (one-step recursive algorithm).
4.2 Comparison of the N-Step Recursive Algorithm
Similar to the previous section, this section mainly studies the comparison of three different algorithms when the N-step of the data sequence is updated. For the random sequence used to compare the N-step recursive algorithm, the number of data replaced at each time in the sequence satisfies an arithmetic sequence with 500 as the first term and 500 as the tolerance. Then, the classical definition formula, the recursive formula proposed by Deng, and the newly proposed N-step recursive formula are used to calculate the variance of this dynamically changing random sequence. And the time consumption and calculation error of the three algorithms are recorded in each calculation. The calculation results of the N-step recursive algorithm and comparison with the other two algorithms are given in Figs. 3 and 4. As can be seen from the figures, when the sequence length is fixed, and the data is dynamically updated and replaced multiple times, the time efficiency and calculation accuracy of the N-step recursive algorithm are better than the definition algorithm and the improved algorithm of Deng et al. [29].
This numerical experiment clearly shows that the recursive formula proposed in this paper has advantages that other algorithms do not have when calculating the variance of the time-varying data series with a constant sample size. By introducing the idea of sliding windows, the efficiency of calculating the variance of such data series is greatly improved.
Time efficiency comparison of three algorithms (N-step recursive algorithm).
Calculation accuracy comparison (N-step recursive algorithm).
5. Conclusion
For a sequence data with fixed length and fast updating of data, the traditional algorithm using classical definition will cost large time and resource consumption due to O(n) time complexity and frequent access to the computer disk, which may cause performance degradation of the whole system. However, the N- step sliding recursion formula proposed in this paper does not need to read the entire data sequence from the computer disk each time after the data is updated. Instead, the variance of the updated sequence can be directly calculated based on the results of the previous calculation and the current updated data, which shortened the calculation time from O(n) to O(1) time complexity. It can not only improve the time efficiency but also greatly save the storage resources of the computer, making the data analysis real-time and low-cost. In addition, this type of recursion can be extended to the calculation of kurtosis and skews, or the calculation of higher-order origin moments and central moments, which demonstrates the potential value for time-consumption data analysis or high speed signal processing.
Acknowledgement
This paper has been supported by Sichuan Science and Technology Program (No. 2020YFS0318, 2019YFS0155, 2019YFS0146, 2020YFG0430, 2020YFS0307), the Educational Reform Research Project of Southwest University of Science and Technology (No. 15xnzd05), Undergraduate Innovation Fund Project Accurate Funding Special Project by Southwest University of Science and Technology (No. JZ19-057).