## Pum-Mo Ryu*## |

Example 1 | |

Tweet mention (Korean) | 주가가 ( ju-ga-ga ) 떨어져서 ( tteol eo jyeo seo ) 짜증나 ( jja jeung na ) |

English translation | It’s annoying because the stock price is falling. |

POS-tagged | 주가 /nc+ 가 /jc 떨어지 /pv+ 어서 /ec 짜증나 /pv+ 아 /ef |

Sentiment analysis | Polarity: NEGATIVE Clue expression: |

Example 2 | |

Tweet mention (Korean) | 삼성 ( sam-sung ) 주가 ( juga ) 오르네 ( oreu-ne ) ^^ |

English translation | Samsung’s stock price goes up ^^ |

POS-tagged | 삼성 /nc 주가 /nc 오르 /pv+ 네 /ef ^^/s |

Sentiment analysis | Polarity: POSITIVE Clue expression: ^^/s |

In this section, we define the data model used in prediction models, illustrate the basic prediction models, describe a CCF analysis to select data that are highly correlated with the unemployment rate, and finally show the result of fitting the selected data with the prediction models.

4.1 Data ModelThe data models used in this paper are as follows:

• * Unemployment Rate Index * ( * UI * ): The monthly unemployment rate index of Korea. We denote the unemployment rate in * t ^{ th } * month as

• * Google Index * ( * GI * ): The monthly Google search keyword frequency of a keyword provided by Google Trends. We denote the Google Index of keyword * w * in * t ^{ th } * month as

• * Social Keyword Index * ( * SKI * ): The monthly social media frequency of a keyword extracted from social media content. We denote the Social Keyword Index of keyword * w * in * t * -th month in media m as * f _{ w,m,t } * .

• * Social Sentiment Index * ( * SSI * ): The monthly social media sentiment frequency of a keyword extracted from social media content. We denote the Social Sentiment Index of keyword * w * in * t ^{ th } * month in media

As one of the most useful methodologies for analyzing a time series, the autoregressive integrated moving average (ARIMA) model offers great flexibility in analyzing various time series and gives accurate forecasts [ 13 ]. The ARIMA model with seasonal terms (SARIMA) can be written as follows:

where [TeX:] $$\phi _ { \mathrm { P } } ( B ) , \theta _ { q } ( B ) , \Phi _ { \mathrm { P } } ( B ) , \Theta _ { \mathrm { Q } } ( B )$$ are as follows:

[TeX:] $$\emptyset _ { p } ( B ) = 1 - \emptyset _ { 1 } B ^ { 1 } - \emptyset _ { 2 } B ^ { 2 } - \cdots - \emptyset _ { p } B ^ { p }$$

[TeX:] $$\theta _ { q } ( B ) = 1 - \theta _ { 1 } B ^ { 1 } - \theta _ { 2 } B ^ { 2 } - \cdots - \theta _ { q } B ^ { q }$$

[TeX:] $$\Phi _ { P } \left( B ^ { s } \right) = 1 - \Phi _ { 1 } B ^ { s } - \Phi _ { 2 } B ^ { 2 s } - \cdots - \Phi _ { P } B ^ { P s }$$

[TeX:] $$\Theta _ { Q } \left( B ^ { s } \right) = 1 - \Theta _ { 1 } B ^ { s } - \Theta _ { 2 } B ^ { 2 s } - \cdots - \Theta _ { Q } B ^ { Q s }$$

where B is the backshift (lag) operator [TeX:] $$\left( e . g \cdot B ^ { b } Z _ { t } = Z _ { t - b } \right)$$ and et is white noise [TeX:] $$W N \left( 0 , \sigma _ { e } ^ { 2 } \right)$$ . For example, the ARIMA( * p,q * ) model can be simply described as follows:

where * y _{ t } * is the unemployment rate of

As an extended version of the ARIMA model, the ARIMAX model also includes other independent (predictor) variables. The ARIMAX model is similar to a multivariate regression model but allows taking advantage of the autocorrelation that may be present in the residuals of the regression to improve the accuracy of a forecast. An ARIMAX model simply adds the covariate on the right-hand side of ARIMA as follows:

where * x _{ t } * is a covariate at time t and is its coefficient. For brevity, we use only a single covariate in the model above, but more than two covariates can be contained in the model as an additive type. The ARIMAX model can be considered a special case of the transfer function model. Just like the ARIMA model, the ARIMAX model without a seasonal factor—including more than

One of the advantages of the ARIMAX model over ARIMA is that it uses the information of the covariate series. Practically, however, the choice of lag d is not easy, especially when more than two covariates are contained in the model. A simple and useful model incorporating the historical information and covariate information is (seasonal) autoregressive with exogenous variables (ARX) model. The ARX model is a linear difference equation model that relates the input to the output. By increasing the number of exogenous input terms, we can better approximate the observed dynamics in the systems. The ARX model is defined as follows:

where * y _{ t-I } (i=1, …, 12) * is the historical time series of lag

Next, we describe the procedure used to select keywords whose trends are highly correlated with the unemployment rate using a cross-correlation function (CCF). We collected a set of keywords related to the unemployment rate from 10 persons who submitted 100 keywords each. The set consists of 622 keywords. We extracted the GI, SKI, and SSI of each keyword and compared the indices with the UI using CCF in the R package (http://www.r-project.org/).

The problem we are considering is the description and modeling of the relationship between two time series, such as (GI( * w * ), UI) pair, (SKI( * w * ), UI) pair, or (SSI( * w * ), UI) pair. In the relationship between two time series ( * y _{ t } * and

We selected the keywords whose CCF shows high correlation with the UI and whose time lag is between 0 and −4. The selected keywords were used as covariates in the prediction models.

4.4 Model Fitting and PredictionWe fitted the ARIMA, ARIMAX, and ARX models using the UI, GI, and SI, respectively, using the data selected by the CCF analysis. We tested all possible combinations of keywords and selected the best models in each model category as follows:

• * Model_U * : The ARIMA model (Eq. (3)) based on the UI, excluding GI, SKI, and SSI as exogenous variables. The fitted model is shown in Eq. (7):

• * Model_G * : The ARIMAX model (Eq. (5)) based on the UI including GI as an exogenous variable. The UI of the previous month and GI for * 청년실업률 * ( * cheong-nyeon-sil-eop-ryul * )(youth unemployment rate) and * 해고 * ( * hae-go * )(dismissal) are used to fit the model. If we know the GI for * 청년실업률 * ( * cheong-nyeon-sil-eop-ryul * ) 3 months earlier and * 해고 * ( * hae-go * ) 1 month earlier, we can predict the unemployment rate for this month using Eq. (8).

• * Model_K * : The ARX model (Eq. (6)) based on the UI including SKI as an exogenous variable. We fit three models based on different media types: news, blogs, and tweets. Eq. (9) shows a prediction model fitted by the SKI of tweets. The frequency of * 물가 * ( * mul-ga * ) (price) and * 인플레이션 * ( * in-peul-rae-i-syion * ) (inflation) in the tweets is used.

• * Model_S * : The ARX model (Eq. (6)) based on the UI including SSI as an exogenous variable. This model consists of three models based on the media types: news, blogs, and tweets. Eq. (10) shows a fitted model where three SSIs for three keywords— * 실직 * ( * sil-jik * ) (unemployment), * 알바 * ( * al-ba * ) (part-time job), and * 주가 * ( * ju-ga * ) (stock price)—are used. The lag and sentiment for the three keywords are all different from each other.

* Model_U * and * Model_G * are the baselines in our experiment. As our proposed models, * Model_K * and * Model_S * test the usability of social media content in predicting the unemployment rate.

We built the prediction models using data for 2 years from September 2011 to October 2013 and tested the models using data for four months from September 2013 to December 2013. The data have a sufficient time span to model the seasonal characteristics of the unemployment rate. * Model_U * , * Model_G * , and three types of * Model_K * and * Model_S * were compared based on their goodness-of-fit (GOF) and prediction accuracy. * Model_U * and * Model_G * are the baselines in this experiment. GOF was measured using data for two years, with the prediction accuracy measured based on data for four months. We also compared our models with the forecast of Trading Economics (http://www.tradingeconomics.com), a commercial economic data provider that provides predictions for the unemployment rate for each country on a monthly basis.

The GOF and prediction accuracy of models were evaluated using the well-known prediction metrics of mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) as in Eqs. (11), (12), and (13), respectively. Let * y _{ i } * and

In view of GOF (Table 2), the MAPE of the proposed models is between 2.801 and 5.627. This result indicates that we can trust the predicted results with more than 94% confidence level. * Model_U * showed the highest error rate among all models for all metrics. This means that the exogenous variables GI and SI improved the performance of the other models including * Model_G * , * Model_K * , and * Model_S * . In particular, social-media based models * Model_K * and * Model_S * showed a lower error rate than * Model_G * . We can infer that social media contents are more effective in grasping social moods on the topic of labor than GI. * Model_S * showed the lowest RMSE and derived the most stable results without showing sharp error pitches. It reduced the errors by 48.5% compared to * Model_G * . * Model_K * (blogs) showed the lowest MAE and MAPE among all models. Compared to * Model_G, Model_K * (blogs) reduced the errors by 42.0% and 41.0% in the MAE and MAPE metrics, respectively.

Table 2.

Model | RMSE | Improvement in RMSE (%) | MAE | Improvement in MAE (%) | MAPE | Improvement in MAPE (%) |

Model_U (baseline 2) | 0.279 | - | 0.192 | - | 5.629 | - |

Model_G (baseline 3) | 0.222 | - | 0.158 | - | 4.748 | - |

Model_K (news) | 0.165 | 25.6 | 0.138 | 12.6 | 4.230 | 10.9 |

Model_K (blogs) | 0.137 | 38.0 | 0.092 | 42.0 | 2.801 | 41.0 |

Model_K (tweets) | 0.201 | 9.2 | 0.154 | 2.3 | 4.810 | -1.3 |

Model_S | 0.114 | 48.5 | 0.099 | 37.2 | 3.182 | 33.0 |

Table 3.

Model | RMSE | Improvement in RMSE (%) | MAE | Improvement in MAE (%) | MAPE | Improvement in MAPE (%) |

Trading economics (baseline 1) | 0.415 | - | 0.375 | - | 13.664 | - |

Model_U (baseline 2) | 0.317 | - | 0.298 | - | 10.832 | - |

Model_G (baseline 3) | 0.433 | - | 0.403 | - | 14.586 | - |

Model_K (news) | 0.358 | 17.3 | 0.342 | 15.2 | 12.349 | 15.3 |

Model_K (blogs) | 0.448 | -3.5 | 0.390 | 3.3 | 14.245 | 2.3 |

Model_K (tweets) | 0.318 | 26.5 | 0.291 | 27.8 | 10.513 | 27.9 |

Model_S | 0.946 | -118.6 | 0.879 | -118.0 | 31.603 | -116.7 |

In view of the prediction accuracy (Table 3), the three models of * Model_K * showed better performance than * Model_G * for the MAE and MAPE metrics. In particular, * Model_K * (tweets) showed the lowest MAE and MAPE among all models. It reduced the errors by 27.8% and 27.9% for the MAE and MAPE metrics, respectively, compared to * Model_G * . The frequency of the keywords extracted from tweets can be assumed to be related closely to the unemployment rate index. Note, however, that * Model_S * showed a higher error rate than other forecasts, unlike the case of GOF. Therefore, the coverage and precision of the sentiment analysis should be increased when we apply the technique to a real-world application. When the emotional analysis result is applied to the prediction of unemployment rate, the reason for the relatively low performance is the low performance of emotional analysis. There is also the problem of objective analysis being difficult because of the short forecast period. Therefore, future research is needed to improve the accuracy of sentiment analysis. * Model_K * (news) and * Model_K * (tweets) showed better performances than the Trading Economics forecast. As such, our models can be applied to a commercial service. Social media including news and tweets showed better performance than that of blogs in prediction because news and tweets reflect actual social phenomena.

Fig. 3 shows the actual unemployment rate and the fitted and prediction results of the other model. The graphs of the first 24 months are the fitted results, and the graphs of the last four months are the predicted results. The overall fitted graphs show reasonable results, but some of the predicted graphs show irregular patterns. Note, however, that the predicted data by * Model_K * (news) and * Model_K * (tweets) show the same patterns (down-up-down-up) as the UI data for the prediction period. Blogs are good for model fitting but show a low accuracy in terms of prediction compared to news or tweets as in Table 3. We should track the prediction accuracy for a larger number of months to find out the characteristics of social media types in our work.

We presented several models to predict the unemployment rate based on social media analysis. We showed the effectiveness of social media in analyzing and quantifying social moods by applying the data to the unemployment prediction models. Our models derived better results than the GI-based model and simple time-series model. We will apply social media analysis to predict other social indices such as consumer price index or consumer sentiment index. Because such indices are tightly coupled with public life, a large number of mentions regarding these indices will be posted in social media. To apply such analysis, the sentiment analysis for Korean should be improved.

He received the B.S. degree in computer engineering from Kyungpook National University, Daegu, South Korea in 1995 and the M.S. degree in computer engineering from POSTECH, Pohang, Korea, in 1997. He received the Ph.D. degree in computer science from KAIST, Daejeon, Korea, in 2009. Currently he is an associate professor in Department of ICT & Language Processing, School of Southeast Asian Studies, Busan University of Foreign Studies, Busan, Korea. His research interests include natural language processing, text mining, knowledge engineering and question answering.

- 1 N. Askitas, K. F. Zimmermann, "Google econometrics and unemployment forecasting,"
*Applied Economics Quarterly, 2009*, vol. 55, no. 2, pp. 107-120. doi:[[[10.2139/ssrn.1480251]]] - 2 F. D’Amuri, J. Marcucci, "'Google it!' Forecasting the US unemployment rate with a Google job search index,"
*FEEM W orking Paper No. 31*, 2010.doi:[[[10.2139/ssrn.1594132]]] - 3 J. Pavlicek, L. Kristoufek, "Nowcasting unemployment rates with google searches: evidence from the visegrad group countries,"
*PloS Onearticle no. e0127084, 2015*, vol. 10, no. article e0127084. doi:[[[10.1371/journal.pone.0127084]]] - 4 P . S. Dodds, K. D. Harris, I. M. Kloumann, C. A. Bliss, C. M. Danforth, "Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter,"
*PloS Onearticle no. e26752, 2011*, vol. 6, no. article e26752. doi:[[[10.1371/journal.pone.0026752]]] - 5 United Nations Global Pulse, "Using social media to add depth to unemployment statistics,"
*UN Global Pulse White Paper*, 2011.custom:[[[-]]] - 6 V. Lampos, N. Cristianini, "Nowcasting events from the social web with statistical learning,"
*ACM Transactions on Intelligent Systems and T echnologyarticle no. 72, 2012*, vol. 3, no. article 72. doi:[[[10.1145/2337542.2337557]]] - 7 A. Signorini, A. M. Segre, P . M. Polgreen, "The use of Twitter to track levels of disease activity and public concern in the US during the influenza A H1N1 pandemic,"
*PloS Onearticle no. e19467, 2011*, vol. 6, no. article e19467. doi:[[[10.1371/journal.pone.0019467]]] - 8 S. Lim, C. Lee, P. M. Ryu, H. Kim, S. K. Park, D. Ra, "Domain‐adaptation technique for semantic role labeling with structural learning,"
*ETRI Journal, 2014*, vol. 36, no. 3, pp. 429-438. doi:[[[10.4218/etrij.14.0113.0645]]] - 9 L. Velikovich, S. Blair-Goldensohn, K. Hannan, R. McDonald, "The viability of web-derived polarity lexicons," in
*Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg*, PA: Association for Computational Linguistics, 2010;pp. 777-785. custom:[[[https://dl.acm.org/citation.cfm?id=1858118]]] - 10 K. J. Lee, J. E. Kim, B. H. Yun, "Extracting multiword sentiment expressions by using a domain‐specific corpus and a seed lexicon,"
*ETRI Journal, 2013*, vol. 35, no. 5, pp. 838-848. doi:[[[10.4218/etrij.13.0113.0093]]] - 11 C. Strapparava, A. Valitutti, "WordNet affect: an affective extension of WordNet," in
*Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC)*, Lisbon, Portugal, 2004;pp. 1083-1086. custom:[[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.4281]]] - 12 A. Esuli, F. Sebastiani, "SentiWordNet: a publicly available lexical resource for opinion mining," in
*Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC)*, Genoa, Italy, 2006;pp. 417-422. custom:[[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.7217]]] - 13 G. E. P . Box, G. M. Jenkins, Time Series Analysis: Forecasting and Control. Englewood Cliffs, NJ: Prentice Hall, 1976.custom:[[[-]]]