Article Information
Corresponding Author: Lin Luo* , lin.l.csc@gmail.com
Ke Mu*, Dept. of Information and Control Engineering, Liaoning Shihua University, Fushun, China, muku@lnpu.edu.cn
Lin Luo*, Dept. of Information and Control Engineering, Liaoning Shihua University, Fushun, China, lin.l.csc@gmail.com
Qiao Wang*, Dept. of Information and Control Engineering, Liaoning Shihua University, Fushun, China, 63526336@qq.com
Fushun Mao**, Synthetic Detergent Factory of Fushun Petrochemical Company, China National Petroleum Corporation, Fushun, China, 12hgh@sina.com
Received: November 18 2020
Revision received: December 28 2020
Accepted: January 6 2021
Published (Print): April 30 2021
Published (Electronic): April 30 2021
1. Introduction
The industrial Internet of Things (IoT) development and measuring instruments make industrial records from numerous measurement variables available [1-3]. As a key component in the modern industrial system, data-driven process monitoring is commonly used to protect plant safety, reduce production costs, and improve the quality of products.
Multivariate techniques based on latent variable (LV) methods have been successfully used in the application of process monitoring [4-7], including principal component analysis (PCA), canonical correlation analysis (CCA) and slow feature analysis (SFA), among others. In general, a latent space in the LV model is explored to reveal the low-dimensional inherent structure of original measured variables, and its complementary residual space is to locate the noises and outliers. Once the model has been determined, MSPC control charts with these two spaces are required to detect faults, respectively, referred to as [TeX:] $$\mathrm{T}^{2}$$ and SPE control charts. However, the original multivariate techniques share several drawbacks, including a mass of data are required to generalize well, and parameter selection is difficult for their nonlinear extensions.
As a branch of machine learning, deep networks have become powerful tools for effectively dealing with large-scale data and deep representations [8-11], which greatly impact the final results. Recently, deep learning has been present in the application of process monitoring, such as deep belief network (DBN) [9], stacked sparse auto-encoder (SAE) [12], and recurrent neural network (RNN) [13]. The application of deep learning on monitoring process conditions is still developing, although it often provides more useful insights than the traditional shadow methods. For example, Luo et al. [10] studied an adaptive monitoring strategy with a tensor factorization layer merged into the deep neural network. They extracted fault-sensitive characteristics with the tensor representations, which enable efficient cross-layer knowledge. However, it is challenging to preserve the process’s dynamic information, which is important for long-term real-time scenarios. Recently, a typical deep learning model applied to fault diagnosis is long short-term memory (LSTM) framework [14]. The feature extraction process suitably models the process dynamics with recurrent feedback. However, a major limitation of the existing LSTM for the chemical process is that the local information is hardly incorporated into the posterior model. To improve the generalization capability, the local temporal dependencies should be preserved across different time steps.
Motivated by the above observations, this paper proposes an attention augmented network for application to online fault detection and classification. In the proposed deep network, an extra layer is fused with an attention mechanism that makes the layer preserve a time-series dynamic nature and allows its application for an online scenario. Furthermore, the batch normalization procedure’s design is utilized to reduce the internal covariate shift of LSTM. Contrary to the conventional shadow method-based fault diagnosis, where feature extraction and classification are generally independent of each other, the proposed method is trained in an end-to-end manner which simultaneously makes the interpretable model and the feature expression learnable. Experimental results on the Tennessee Eastman (TE) benchmark process show that the proposed network can highlight different temporal information’s contribution, which helps further analysis on industrial fault features.
The layout of the paper is organized as follows. Section 2 briefly reviews the RNN-based process monitoring method. In Section 3, the proposed attention augmented network approach-based fault diagnosis model is put forward, with the design of the network structure and fault diagnosis procedure. In Section 4, comprehensive comparisons between the attention augmented network-based fault diagnosis method with the existing strategies are carried out with the TE benchmark process. Finally, concluding remarks are drawn in Section 5.
2. Related Works on Fault Diagnosis Using RNN
Assume that a multivariate time series with N samples and D dimensions can be defined as [TeX:] $$\mathbf{X}_{k} \in\mathbb{R}^{D \times \Delta t}$$, k=1,⋯,N, where contains a sequence of ∆t sampling points. The input data defined as X=[TeX:] $$\left[\mathbf{x}_{1}, \cdots \mathbf{x}_{T}\right] \in \mathbb{R}^{D \times \Delta t}$$ is fed into the input layer, where T is the time-steps for a sequence. In the hidden layers, RNN [1], [13] maintains a sequence of hidden states [TeX:] $$\mathbf{h}_{\Delta t}$$ for each time step ∆t,
where tan(∙) is the hyperbolic tangent function, [TeX:] $$\mathbf{W} \in \mathbb{R}^{D}_{h} \times D_{h}$$ is the recurrent weight matrix need to be estimated, and [TeX:] $$\mathbf{U} \in \mathbb{R}^{D_{h} \times D}$$ signifies the projection matrix. Note that [TeX:] $$D_{h}$$ is the number of neurons in each hidden layer whose values need to be pre-determined. A prediction [TeX:] $$\mathbf{y}_{\Delta t}$$ can be made using the softmax operation with a hidden state and a weight matrix,
where [TeX:] $$\mathbf{Y}=\left[\mathbf{y}_{1}, \cdots \mathbf{y}_{T}\right] \in \mathbb{R}^{D_{h}}$$ is a tensor of the output.
One of the major issues in RNN, a vanishing gradient problem, has been found in many applications. A typical LSTM network to tackle this problem is to generate an associated sequence of outputs [TeX:] $$\mathbf{y}_{\Delta t}$$ by three gates and a memory cell. The computation at each time step is as follows,
where [TeX:] $$\sigma(\cdot)$$ is the sigmoid function, the symbol [TeX:] $$\odot$$ is the elementwise multiplication. [TeX:] $$\mathbf{W}^{u}$$, [TeX:] $$\mathbf{W}^{f}$$, [TeX:] $$\mathbf{W}^{o}$$are the weight matrices of input, forget and output gate, respectively. [TeX:] $$\mathbf{W}^{c}$$ is the weight matrix of memory cell.
3. Temporal Attention Augmented Network for Fault Diagnosis
3.1 Temporal Attention Augmented Layer
Although the layer learns independent temporal dependencies along with each mode, the difficulty with long-term dependencies still arises in the LSTM. It might make the signals about these dependencies tend to be hidden by the smallest fluctuations. This means that squashing local information of the entire sequence poses a potential bottleneck in the performance improvement of LSTM. Inspired by incorporating the position information into sequence-to-sequence learning [15], an attention augmented layer is proposed to overcome the short-term dependencies. Specifically, a vector generated from the sequence of the hidden states [TeX:] $$\mathbf{c}_{\Delta t}$$ is obtained by a weighted sum of these states [TeX:] $$\mathbf{h}_{k}$$, k=1,⋯,T, at position k,
where [TeX:] $$\alpha_{\Delta t, k}$$ is the weight of each hidden state, which can be given as,
where the alignment model [TeX:] $$e_{i, j}$$ is learned by the following equation,
where [TeX:] $$\mathbf{v}_{\alpha}^{T}$$ is learnable row vector, [TeX:] $$\mathbf{W}^{\alpha}$$ and [TeX:] $$\mathbf{U}^{\alpha}$$ are learnable weights. The parameter vector [TeX:] $$\mathbf{v}_{\alpha}^{T}$$ and matrix [TeX:] $$\mathbf{W}^{\alpha}$$, [TeX:] $$\mathbf{U}^{\alpha}$$ can be learned from a two-layer multi-layer perceptron without bias.
Using the hidden state [TeX:] $$\mathbf{h}_{k}$$ and [TeX:] $$\mathbf{h}_{\Delta t-1}$ from the recurrent unit in the decoder module at time [TeX:] $$\Delta t-1$$, the alignment model matches the inputs around position [TeX:] $$\Delta t$$ and the output at position k. The softmax function in Eq. (5) makes the model produce the generated vectors [TeX:] $$\boldsymbol{c}_{\Delta t}$$ that concerns a specific component of the input sequence. To represent the overall information of the sequence, multiple hops of attention need to be performed so that multiple of the generated vectors [TeX:] $$\boldsymbol{c}_{\Delta t}$$ focuses on different parts of the sequence. The graphical illustration of the classical LSTM and the proposed model are shown in Fig. 1.
Illustration of deep network architecture with (a) LSTM and (b) the proposed temporal attention augmented layer, respectively.
3.2 Fault Diagnosis with Temporal Attention Augmented Network
From the hidden state h_∆t concerned to the previous state [TeX:] $$\mathbf{h}_{\Delta t}$$, the output [TeX:] $$\mathbf{h}_{\Delta t-1}$$ and the generated states [TeX:] $$\mathbf{y}_{\Delta t-1}$$, the output of the last hidden layer has the following form,
and the softmax layer calculates a conditional probability of each output neuron for the industrial system health conditions.
A fault detection problem is a classification task to indicate which condition the system belongs to. Two different data sets can be trained by the attention augmented network, one is the operation data from the normal operation, and the other is from the abnormal condition. Along with the accuracy, the other two criteria commonly assessed for the model performance, fault detection rate (FDR) and false alarm rate (FAR), should be defined as follow:
4. Experiment on Tennessee Eastman Process
In this section, the proposed method’s performance evaluation is carried on the TE process, which is a benchmark process for the process modeling and monitoring. A brief description of the TE process is provided firstly, and feature engineering is utilized to improve the later deep network’s performance. To evaluate the traditional LSTM network’s performance, LSTM with batch normalization (BNLSTM) and the attention augmented network (AAN), a series of experiments are then conducted in the fault detection and classification of the multivariate TE sequential data.
4.1 Process Description
The TE process has been extensively explored in-process monitoring and control communities as a source of the available dataset for comparing various process control and monitoring techniques. The process contains five major process units: a reboiled striper, a cooling condenser, a flash separator, an exothermic two-phase reactor, and a recycling compressor. There are a total of 52 measurements available, in which 41 and 11 measurements are for process variables and manipulated variables, respectively, and a set of 20 programmed fault modes are defined in [16].
For the normal operation, each data set contains a simulation run of 25 hours with a sampling interval of 3 minutes, and it consists of 500 samples. For the faulty operation, each test data set for one fault mode (introduced at 160th sample) consists of 960 samples. All the samples were normalized to zero mean and unit variance.
4.2 Feature Selection
Feature selection is one of the core concepts in fault detection and classification, which impacts the model’s performance. To identify nonlinear feature interactions and reliably extract relevant features, the importance of features from a model can be automatically estimated by a gradient boosting machine (GBM) implemented by the LightGBM. The importance score is calculated for the individual decision tree by the number of split points that improve the area under the curve (AUC). The feature importance is then averaged over all of the decision trees within the model. The training procedure is repeated 10 times to reduce the variance in the resulting score.
The sorting features according to the cumulative importance in (a) IDV15 and (b) IDV17, respectively.
The cumulative feature importance versus the number of features in (a) IDV15 and (b) IDV17, respectively.
Fig. 2 shows the 20 most important features in IDV15 and IDV17 on a normalized scale where the features sum to 1, respectively. Meanwhile, it also allows cumulative feature importance to find the number of features. A threshold is used to identify the number of features required to reach a specified cumulative feature importance. We set the threshold to 0.8 in the experiments, which means the number of features accounts for 80% of the total importance. For example, Fig. 3 shows that there are 23 and 28 features that contributed to the specified cumulative importance in IDV15 and IDV17, respectively.
4.3 Effects of Temporal Instances
In the encoder-decoder architecture, such as LSTM, where the entire sequence’s information is squashed to a single vector, the local information in time instances is hardly incorporated into the posterior sequence. The problem may degrade the efficiency of the encoder-decoder architecture. Attention augmented mechanism solves this problem by introducing additional weights which contain information surrounding a particular time instance in the past sequence.
In the following experiments, the proposed AAN, the classical LSTM, and BN-LSTM were implemented using Python and the TensorFlow backend. The input layer in all the networks used a sigmoid activation function. The networks were initialized by the Xavier initialization [17] to ensure the signals do not vanish away, and the Adam was selected as the optimizer during the training step. The candidate structures and parameters for these methods are listed in Table 1, where the entire network structure is the number of neurons in input, hidden, and output layers. Regarding regularization techniques, dropout was applied with a percentage of 0.5 to all hidden layers’ output.
The candidate structures and parameters for LSTM, BN-LSTM, and AAN
a Input activation is sigmoid, hidden activation is tanh.
The dimension of inputs was associated with the number of features required to the 80% cumulative importance. In total, all configurations were trained for a maximum of 40 epochs with a mini-batch size of 32 samples. To evaluate the encoder-decoder structure and attention-based model in the local temporal representation, we constructed three different baseline configurations with [TeX:] $$\Delta t$$={350,400,450}. Meanwhile, each configuration was repeated 20 times.
Fig. 4 reports the experiment results in IDV15 and IDV17 with three baseline configurations, respectively. As shown in Fig. 4(a), it is clear that the attention augmented mechanism outperforms other competing models as gradually increasing of temporal instances, e.g., [TeX:] $$\Delta t$$=450. Inspecting the box-plot in Fig. 4(b) finds that more instances we used in training, the attention augmented mechanism holds the more stable and accurate representation on the sequence. Although being with batch normalization, the BN-LSTM model is inferior to the proposed one. It is because the information in the attention layer is capable of seizing the temporal features across time steps.
Box-plot of prediction accuracy under three different baseline configurations with [TeX:] $$\Delta t$$={350,400,450}, repeating for 20 times, in (a) IDV15 and (b) IDV17, respectively. The solid and dashed lines are the median and mean of the resulting accuracy, respectively. The circles denote outlier points.
Besides, the attention mechanism in fault detection and classification gives opportunities for interpreting and visualizing the contribution of the temporal instances being attended to. An additional layer with the same number of output parameters as the input layer is applied to observe how each of the [TeX:] $$\Delta t$$={20,50} events in the input instances contributes to the decision function, see Fig. 5.
Contribution of the temporal instances being attended to (a) IDV15 and (b) IDV17 with [TeX:] $$\Delta t$$={20,50}, respectively.
Visualizing the average attention is considered to each temporal instance during the training process. This would mean that decoder pays much attention to the next state if the attention value is significant.
4.4 Fault Detection Results
When validation data was available in the offline modeling phase, 20 datasets were merged, each containing 480 normal samples and extra 800 samples collected under one fault mode as the validation data set. The proposed AAN, BN-LSTM, DBN with Gaussian activation function [8], deep artificial neural networks (DANN) [18] were constructed for comparative analysis on the fault detection performance. The results from different methods are summarized in Table 2. The ANN model shows the best overall fault detection rate than the other three methods. It can be seen that AAN provides a lower misclassification rate than the other three methods for faults IDV2, IDV8, IDV9, IDV11, IDV13, IDV15, IDV18, and IDV20.
Moreover, they show similar performances for other fault IDs. The improved accuracy to fault classification in ANN lies in the fact that the attention weights retain the long-term dependencies at each time step. The temporal attention can determine the local hidden state, referring to the previous states across all time steps. However, in the case of IDV5, IDV16, and IDV19, better fault detection rates can be found in the DANN method, while DBN performed better in the case of IDV17. Our deep network has not been completely optimized in terms of time length selection, counting that there still exists the possibility for improvements on the different types of faults mentioned above.
Furthermore, the results on the classification rates of all the faults are provided using the proposed method and other deep networks, e.g., hierarchical neural network (HNN) [19], stacked SAE [12] and DANN. The results are illustrated in Fig. 6. It can be seen that almost all the samples can be classified correctly by ANN. Inspecting the simulation results can be concluded that AAN has superior performance in fault detection and classification. This is due to the introduction of the temporal attention mechanism in the ANN model.
Fault detection rates of different data-driven methods
Fault classification rates of different deep network models.
5. Conclusion
In this paper, we proposed fault detection and diagnosis scheme based on a deep network, where the temporal attention mechanism is designed on the network layer. The proposed scheme has the following notable features due to the local mechanism: the ANN training procedure integrates into an end-to-end manner. It is possible to realize parameter update of the feature extraction and fault classification synchronously. Moreover, the feature extraction relying on the handcrafted operation is significantly reduced. AAN explicitly considers the importance and contribution of each temporal instance and allows further analysis of the time instances of interest. AAN adaptively analyzes the dynamic information of the industrial process with the usage of LSTM. Case studies on the TE process demonstrated that the AAN-based approach shows superior performance over the conventional classification methods and enhances the interpretability of the hidden state’s feature. A promising direction is to address batch process monitoring problems with the attention augmented network in future work. Moreover, the information contained in the temporal and spatial domains should be shared across layers to enable efficient and general knowledge. Hence, the design on the shared layers should be in a further direction.
Acknowledgement
This paper is supported by National Natural Science Foundation of China (No. 61703191), the Foundation of Liaoning Educational Committee (No. L2017LQN028), the Scientific Research Foundation of Liaoning Shihua University (No. 2017XJJ-012).