2. Method of Burmese Sentiment Analysis based on Transfer Learning
As shown in Fig. 1, training the sentiment classification model in Burmese can be divided into three steps.
(1) Pre-training the sentiment classification model in English.
(2) Training the sentiment classification model in Burmese using the parameters of the English sentiment classification model by incorporating the features of the former into the semantic space of the latter, obtaining the classification of sentiments of Burmese, and
(3) Using tagged data to tune the Burmese model.
Using the word vector conversion tool word2vec, the English sentence was expressed as word vectors, and the vector form corresponding to the input to the convolutional neural network (CNN) was thus established. The features of a sentence are extracted by the CNN to obtain an effective feature repre¬sentation and a convolutional nerve. The features extracted by the network are max-pooled to obtain the most valuable part, and the softmax layer outputs the classification of sentiments according to probability. The classification of English sentiments was improved by pre-training the model. The Burmese word vector was integrated into the semantic space of English through mapping. For a merged Burmese sentence, the parameters between the corresponding pairs of network layers are shared in the same way, and a sentiment classifier for Burmese is obtained. Finally, we use Burmese data with sentiment markers to tune the model.
Li et al. [14] trained a neural network using a combination of data from a main language and an auxiliary language. In contrast to this method, all layers of our cross-language deep learning model were shared by building a mapping between the English and Burmese bilingual word vectors. This implies that we have incorporated the features of Burmese into English. The two languages already have a certain similarity because information can be shared to improve the accuracy of the classification of sentiments in Burmese.
Overall architecture of our model.
2.1 Pre-training English Sentiment Classification Model
2.1.1 Extracting features of the convolutional layer of English model
Because a CNN can obtain contextual features of vocabulary, it can help obtain effective feature representations. However, for natural language processing tasks, the input is not a pixel of an image but a sentence represented by a matrix. The convolution operation of the target matrix obtains each local feature and combines the feature vectors to obtain the feature vector of the target matrix. In pre-training the English network, the input is an English sentence [TeX:] $$X$$ characterized as a sentence vector matrix [TeX:] $$\left[C W_1, C W_2, \ldots, C W_n\right]$$ consisting of the word vector of the sentence. Each row in the matrix represents an English word vector [TeX:] $$CW$$, , and n represents the number of words in the sentence. The vector representation can be obtained by combining contextual information and new English sentences. As in the method proposed by Wang et al. [16], our convolution operation contains a filter [TeX:] $$W$$ that cause the [TeX:] $$CW$$ vectors to generate a new feature [TeX:] $$Z$$:
Here, [TeX:] $$W_{j}$$ is the ith input matrix and ith instance, and [TeX:] $$W_{j}$$ is the [TeX:] $$j$$-th filter in the convolution operation [TeX:] $$(1 \leq j \leq 30)$$. When extracting Burmese features, all filters [TeX:] $$W$$ share the extraction parameters, which significantly reduces the number of parameters in the learning process. After passing through the filter [TeX:] $$W$$, the corresponding characteristic output [TeX:] $$Z$$ is obtained. To obtain the most useful information from eigenvector [TeX:] $$Z$$, we perform a max-pooling operation on [TeX:] $$Z$$:
The feature vector of an English sentence automatically synthesizes a linear vector. To learn more complex features, we designed a nonlinear layer and selected a rectified linear function (ReLU) as an activation function. In training the sentiment classification model of English, using the sigmoid function when the random initial network weight is too large causes network training to become unstable; however, using the ReLU activation function can effectively prevent the weight from being too large or too small. The activation function can be written as
Here, [TeX:] $$W_{y}$$ is a linear transformation equation that maps the vector [TeX:] $$T$$ to the hidden layer and uses the ReLU activation function to obtain g, which denotes a higher level of the characteristics of English speech syllables.
2.1.2 Attention mechanism
Following convolution, the attention mechanism is used to obtain the feature information of different important programs to improve classification accuracy [15]. The attention text was used here.
where [TeX:] $$x_{i j}$$ represents a sentence, [TeX:] $$U_{i}$$ represents the label corresponding to this sentence, fun represents a forward network with a hidden layer, and [TeX:] $$s_{i}$$ and [TeX:] $$a_{i}$$ represent the importance of the corresponding words in the text.
2.1.3 Output layer of English sentiment classification
To estimate the classification of the sentiment expressed in each input English sentence [TeX:] $$X$$, the predicted output of the softmax layer is:
where [TeX:] $$W_{p}$$ is a linear transformation equation, vector [TeX:] $$g$$ and attention text [TeX:] $$a_{i}$$ are fully connected and mapped to the output layer, and [TeX:] $$\otimes$$ indicates a fully connected operation. Each output o is the sentiment score of the input English sentence vector matrix [TeX:] $$X$$, and there are two predictions of 0 and 1. If the score is 0, the relevant English sentence represents negative sentiment; if the score is 1, the relevant English sentence expresses positive sentiment.
2.1.4 Defining the English sentiment classification model losses to solve the model
Finally, the probabilities of the positive and negative terms are obtained through the softmax layer, and the highest probability is used as the label for English sentiment classification.
The final label [TeX:] $$U_{c}$$ is obtained according to the calculated probability. If the positive sentiment value is greater than the negative sentiment value, [TeX:] $$U_{c}$$ is a positive sentiment and vice versa. As with English sentiment classification, cross-entropy is used as a loss function.
where [TeX:] $$U_{c}$$ is the sentiment rating of the model and [TeX:] $$\bar{U}_c$$ is the tag of the relevant sentence. By finding a loss in the model, all its parameters are updated in reverse such that they are closer in value to the data for the classification of English sentiments. The cross-entropy alone was used as the loss function. During the update, the values of the parameters may be too large or small. Therefore, the parameters of the L2 canonical constraint model were used here by increasing the regularity constraints. The parameters in the model include the English sentence vector [TeX:] $$x$$ x input to the model and the weight matrices [TeX:] $$W_j, W_y, W_p$$. The loss in classifying sentiments in English can then be expressed as
A stochastic gradient descent algorithm was used to solve the model to obtain the minimum loss of the sentiment classification model in English. Once the model converges, its parameters [TeX:] $$W_j, W_y, W_p$$ were obtained for sentiment analysis in English, and fixed to obtain [TeX:] $$W_{c j}, W_{c y}, W_{c p}$$. These parameters were also used in the model to classify the sentiments in Burmese.
2.2 Fusing Training of Burmese Classification Model using Features for English Sentiment Classification
The parameters are used as the initialization parameters for sentence sentiment classification in Burmese. Using the mapping relationship between English and Burmese, the latter language is mapped onto the space of the former, and the characteristics of sentiment classification in English are used to compensate for the lack of features in Burmese. Finally, the model parameters were updated using the loss.
2.2.1 Bilingual vectorization representation
The English-Burmese bilingual word vector map and bilingual sentence mapping between English and Burmese were used to establish the relationship between sentences in the languages. This reduces the difference between languages and avoids performance degradation during the feature transfer. The use of mapping can also complement information that is absent from the classification of sentiments in Burmese. The Burmese sentence input to the model consisted of words. The word vector of each Burmese word is [TeX:] $$M_{w}$$ and the target sentence matrices are [TeX:] $$\left[M_{W 1}, M_{W 2}, \ldots, M_{W S}\right]$$.
The crucial step in learning bilingual word-vector mapping is to establish a mapping relationship of words through a bilingual dictionary. Words that do not appear in the dictionary are then used to find the target words according to the constructed word-mapping relationship. In our model, Mikolov’s method was used in a nested loop [17]. Each time a loop is executed, the dictionary is updated and used to train the mapping relationship, until the model converges.
In Fig. 2, [TeX:] $$X$$ represents the distribution of the word vectors of Burmese in the Burmese semantic space, and [TeX:] $$Z$$ represents the distribution of the word vectors of English in the English semantic space. Using an initial English-Burmese dictionary, the spatial distances between pairs of translated words in the dic¬tionary were minimized, and a transformation matrix [TeX:] $$W$$ was learned. The word vectors of Burmese were then mapped into the semantic space of English using [TeX:] $$W$$, and the dictionary was supplemented. Using this new dictionary, the spatial distances between the translated words in the dictionary were minimized again and the transformation matrix W was relearned to further expand the dictionary. This iteration stops when the dictionary does not expand in size during the successive iterations.
Establish bilingual word vector mapping.
A self-learning English-Burmese training framework is proposed in this study.
Input: [TeX:] $$A$$single-word vector trained by two languages in their respective corpora. [TeX:] $$X$$ is the original language, [TeX:] $$Z$$ is the target language, and [TeX:] $$D$$ is the bilingual dictionary. The process is as follows.
a) Iteration (through the iterative process, constantly expand the dictionary).
b) Spatial mapping matrix [TeX:] $$W$$ obtained by [TeX:] $$(X, Z, D)$$ training.
c) Expand dictionary [TeX:] $$D$$ by [TeX:] $$(X, Z, W)$$ until the model converges.
d) Until the model converges.
e) Evaluation dictionary [TeX:] $$D$$.
2.2.2 Extracting features of Burmese sentences through convolution layers
The convolution operation of the target matrix obtains each local feature and combines the feature vectors to obtain the feature vector of the target matrix. In pre-training the English network, the input is a Burmese sentence [TeX:] $$X$$ characterized by a sentence vector matrix [TeX:] $$\left[C W_1, C W_2, \ldots, C W_n\right]$$ consisting of the word vector of the sentence. Each row in the matrix represents an English word vector [TeX:] $$CW$$, and n repre¬sents the number of words in the sentence. The representation of the vector can be obtained by combining the contextual information of the syllables into new Burmese sentences. Using the same model as for English, the parameters of the convolution network filter [TeX:] $$CW$$ were used to convolute the sentence vectors of Burmese, the features were extracted, and a new vector [TeX:] $$M_{W}$$ was generated
where [TeX:] $$X_{Bi}$$ is the ith Burmese sentence input to the model, and [TeX:] $$W_{cj}$$ represents the well-trained parameters of the English model. Parameters that have been trained on the English model are not changed here, and the vectors of Burmese sentences are convoluted to extract their features. The same parameters were used in English [TeX:] $$(1 \leq j \leq 30)$$. Once the sentences pass the filter [TeX:] $$W_{cj}$$, a corresponding characteristic output [TeX:] $$Z_{Bs}$$ is obtained. To obtain the most useful information from the feature vector [TeX:] $$Z_{Bs}$$, we perform the same max-pooling operation as in English:
For the most valuable information, [TeX:] $$m_{Bs}$$, the ReLU activation function used for sentiment classification in English was employed:
where [TeX:] $$W_{Cy}$$ is the linear transformation equation. Vector [TeX:] $$T$$ is mapped to the hidden layer, and the ReLU activation function is used to obtain [TeX:] $$g$$, which represents the characteristics of higher-level Burmese sentences.
2.2.3 Attention mechanism
Following the convolution operation, the attention mechanism is used to obtain feature-related infor¬mation on important programs to improve the accuracy of classification. The attention text is used to indicate the following
Here, [TeX:] $$x_{ij}$$ represents a sentence, [TeX:] $$U_{i}$$ represents the label corresponding to this sentence, fun represents a forward network with a hidden layer, and [TeX:] $$s_{i}$$ and [TeX:] $$a_{i}$$ represent the importance of the corresponding words in the text.
2.2.4 Updating model parameters through Burmese sentiment classification loss
The extracted Burmese sentences pass through the softmax layer and are assigned a Burmese score for each category under the model.
Finally, the probabilities of the positive and negative terms are obtained through the softmax layer, and the term with the highest probability is chosen as the label for sentiment classification in Burmese.
The final tag [TeX:] $$U_{B}$$ was obtained based on probability. If a positive emotion is greater than a negative emotion, [TeX:] $$\bar{U}_B$$ is a positive sentiment, and vice versa. As with sentiment analysis in English, cross-entropy was used as a loss function.
In the training process of the classifier for Burmese, the parameters that were trained in the space of the model for English sentiments were used as the initial parameters. The loss in this model is inversely updated according to its value. This model was applied to the sentiment analysis of the Burmese people.
2.2.5 Model tuning
After mapping Burmese to English, although the result has the semantic features of English, there are deviations. Therefore, a small-scale Burmese training model was used to obtain the loss in the sentiment classification of Burmese, and was solved by minimizing the loss. In the Burmese sentiment classification update parameters [TeX:] $$W_{B j}, W_{B y}, W_{B p},$$ through the formula (19) constraint model when fitting the Burmese sentiment analysis features, the model cannot be infinitely close to Burmese features. The set of annota¬tions in Burmese is only a small part; thus, a constraint is implemented to avoid overfitting. The Burmese sentences were mapped into the semantic space of English, with similar semantic features as English, but with subtle differences. By learning this difference, the model could perform better in classifying senti¬ments in Burmese.
Finally, the loss of sentiment classification in Burmese can be expressed as:
Negative transfer is an important factor that affects the performance of the model. In our study, we mainly artificially evaluated and corrected the negative data. In the constructed English-Burmese parallel data, we adopted professional manual data correction and annotation.
3. Experiment
3.1 Experimental Data
The labeled corpus used for sentiment classification in English was obtained from English sentiment analysis data. As shown in Table 1, 50,000 English sentences are used. Their polarity indicates semantic tendencies. The data for Burmese sentiment classification were obtained from an artificially constructed labeled dataset consisting of 15,000 English-Burmese sentences. Examples of partially constructed English-Burmese parallel sentence pairs are provided in Table 2.
3.2 Experimental Methods and Evaluation Indicators
The Burmese language is resource-poor in terms of its labelled datasets. There is no public dictionary of sentiment-related words. This study used feature transfer (Att-CNN-Trans) [18] to exploit the advan¬tages of an English corpus, specifically sentiment analysis in English, for sentiment classification in Burmese to compensate for the scarcity of Burmese corpora.
To verify the effectiveness of the proposed method, comparative experiments were designed:
(1) Traditional SVM [19] and linear regression (LR) [20] were used for comparison with the proposed method. CNN, LSTM, BILSTM, and fastText [21] were used to train 10,500 labeled Burmese sentences, which did not use an English labeled set to pre-train the model and did not map Burmese sentences to English.
(2) Att-BiLSTM [22] was used to train the 10,500 labeled Burmese sentences.
(3) CNN-Trans was used to train the 10,500 labeled Burmese sentences. A labeled dataset of English was used to pretrain the model, and Burmese sentences were mapped to English. However, an attention mechanism was not used.
(4) Att-CNN was used to train 10,500 labeled Burmese sentences. A labeled dataset of English was used to pretrain the model, and Burmese sentences were mapped to English. An attention mechanism was also used.
Examples of English-Burmese sentences
In our experiments, we followed the standard evaluation indicators. The accuracy was calculated as follows:
3.3 Hyperparameter Setting and Training
For the network structure used here, ReLU was used as the activation function, and multiple sets of convolution kernels were used for training. Filter windows of sizes three, four, and five were used, and the number of convolution units per filter was 100. The number of units in the hidden layer was 300, the output layer was classified by softmax, and dropout was used in the training process to prevent overfit¬ting. Finally, the random gradient descent algorithm was used to update the weights.
3.4 Experimental Results and Analysis
Experiment 1: The results of five-fold cross-validation.
To evaluate the effect of the proposed model, all the data in the experiment were equally divided into five parts: one part was selected as the test corpus, and the other four parts were used as the training corpus. The evaluation results of the model and the experimental results are listed in Table 3, where the experimental average accuracy is 73.72%, which is the experimental effect of the proposed model.
Results of five-fold cross-validation
Experiment 2: Traditional machine learning methods and deep learning models for sentiment classifi¬cation on the same test set.
As shown in Table 4, BiLSTM neural networks were more accurate than single-layer LSTM, indicating that the use of contextual information and consideration of time series can yield better solutions to the problem of classifying sentiments in text.
The CNN neural network model was not as effective as the LSTM neural network on sentiment analysis. A comparison of the results shows that CNN can be used to analyze text information, in addition to its benefits in image processing. The fastText had the lowest accuracy, but its model was simple and training was fast.
Experimental results of traditional machine learning methods and deep learning models on the Burmese sentiment classification
Experiment 3: Ablation study
In the ablation experiment, we used CNN, attention, and BiLSTM mechanisms to perform sentiment analysis on Burmese. The specific comparison results are shown in Fig. 3.
As shown in Fig. 3, the Att-BiLSTM neural network model was not as effective as the Att-CNN neural network when applied to sentiment analysis. A comparison of the results shows that CNN is better at capturing local features than BiLSTM. Local features are important for sentiment analysis of sentences. The Att-CNN-Trans model used here introduced the attention mechanism of the convolutional neural network, whereas the CNN-Trans model did not. The attention mechanism can be used to target sentiment-related features in the extracted textual features, which is why the model yielded better classification performance than the CNN on the validation and test sets. The Att-CNN-Trans mode used transfer learning in the CNN-based attention mechanism, whereas the Att-CNN model did not use an English labeled set to pre-train the model and did not map Burmese sentences to English; the accuracy was also improved because the classification of sentiments in Burmese lacks a labeled corpus. For trans¬fer learning, cross-lingual sentiment features are learned by sharing the neural network layer parameters of the English sentiment analysis model, which can assist in the classification of sentiments in Burmese.