1. Introduction
The postal catalog of newspapers and magazines reflects the publication status of each year. The catalog is a bibliographic overview of various newspapers and magazines published throughout the country. It plays a vital role in regulating the publication, distribution, and publicity of newspapers and magazines. As newspapers and magazines have been transformed and reformed, today's catalogs contain more information. The concise catalog [1] contains substantial amounts of information, including postal code, name, publishing office, issue type, subscription type, the minimum number of consecutive issues, retail unit price, subscription unit price, monthly price, yearly price, issuing mode, layout, number of openings, introduction, postal signs, product categories, best-selling status, bundle specifications, distri¬bution restrictions, remarks, and approval date.
The existing literature on condensed catalogs focuses on two aspects: ordering strategy [2,3] and pricing strategy [4-6] for specific types of newspapers and magazines. The shortcoming of the current research is that it stops at the descriptive statistical analysis of prices and recommends pricing strategies from the marketization perspective while ignoring the internal factors of newspaper and magazine popularity. This paper aims to analyze the critical information in the concise catalog of newspapers and magazines, explore the factors influencing the best-selling status attainment, and use the relevant influencing factors to predict popularity. The findings would benefit the newspaper industry and enhance market competitiveness.
A flowchart of the contents of this paper is shown in Fig. 1.
Flowchart of research content.
As illustrated in Fig. 1, the rest of this paper is arranged as follows. Section 2 discusses the development trends of newspapers and magazines. Section 3 describes data exploration, data cleaning, data trans¬formation, text analysis, and feature construction through correlation and clustering analyses. Section 4 describes the use of classification models to explore factors that influence sales. Section 5 presents several competitive strategies for newspapers and magazines for the media. Finally, Section 6 presents the conclusions and limitations of the study.
2. Background
2.1 Current Status
The development of China's print media has been declining recently, and the prospects of print media are pessimistic [7]. New media such as the Internet, television, and mobile newspapers can quickly deliver information to a larger audience. Hence, their strong development momentum has placed considerable pressure on the survival of traditional newspapers. Problems such as the declining number of readers and rising costs have forced more publishing firms to change business strategies, for example, by compressing publication print volumes and increasingly focusing on new media.
Since 2005, newspaper and magazine distribution has experienced certain adjustments and fluctu¬ations. The market value of newspapers has been declining, and many newspapers now need government intervention to survive [8]. In the current transition period, newspapers and magazines have been operated as a group while the capital scale of publishing, printing, and distribution groups is gradually increasing. By the end of 2017, 125 publishing and media groups were in China, including 47 newspaper groups [9]. Currently, capital is flowing into the media market at an accelerated rate while media houses seek to go public. Meanwhile, other scenarios exist, such as listings, mergers and acquisitions, and restructuring. Newspapers and magazines are subdivided into regions, industries, and groups of people. To go global, foreign media enter the local market through copyright cooperation and domestic means.
2.2 Development Trend
According to the “News and Publication Industry Analysis Report” [10] released by the China Press, Publication, Radio, and Television Network, the government has financed the development of the newspaper industry since 2016 to consolidate the critical position and role of mainstream newspapers as central in dispensing news and shaping public opinion [11]. In 2019, the decline in the total print volume of newspapers and magazines was halted, and profits began increasing. The average selling price of newspapers and magazines has increased while the average number of prints has decreased.
China's media integration has entered an era of transformation from a formal “merge” to an all-around “integration.” As a benchmark for the change and integration of the newspaper industry, the “two micro and one end” (Weibo, WeChat, client) platform has advanced significantly and become the core force of the new media in the newspaper industry. However, the print run of various newspapers has continued to shrink. For example, Life Services and Digest has declined strongly, while Professional Digest and Reader's Digest have declined slightly. The number of prints in various magazines has also declined, with the most apparent decrease occurring in literature and art. By contrast, the reduction in culture and edu¬cation publications has eased. From the perspective of the content structure of newspapers, the proportion of comprehensive newspapers, life service newspapers, and abstract newspapers has continued to decline—the proportion of professional newspapers and magazines, such as teaching assistant news¬papers, has increased. Additionally, the percentage of newspapers targeting older people and women has also increased. From the perspective of the product structure of magazines, the proportion of philosophy and social sciences, culture, and education has continued to decline, and the ratio of comprehensive magazines has increased moderately. Finally, the share of nature with arts, natural sciences, and technology has continued to decline.
3. Data Analysis
This section presents a descriptive analysis of critical information from concise catalogs and data pre-processing operations for feature selection and construction.
3.1 Data Preprocessing
The original 2021 postal catalog contains 1,478 newspapers and 7,415 magazines. As not all of the information is directly accessible, some of the information requires pre-processing, such as removing outliers, filling missing values, and converting data types.
3.1.1 Outlier removal
An outlier is a single value in a sample that deviates considerably from the rest of the observations in the model to which it belongs. The empirical rule states that 99.7% of normally distributed data lies within three standard deviations of the mean. An outlier is a measured value that differs from the mean by more than three standard deviations. The purpose of a study typically determines how to set the outliers to be removed from the data because outliers may introduce significant bias to the results of the analysis [12].
The primary distribution of the subscription unit prices of newspapers and magazines is presented in Table 1.
Statistical summary of unit prices of newspapers and magazines
The maximum value deviated considerably from the median and mode values, as specific types were not excluded. Therefore, the annual magazines and monthly newspapers were removed first. The subsequent statistical summary of the unit prices of newspapers and magazines is presented in Table 2.
Statistical summary of unit prices of newspapers and magazines after removing specific types
According to the empirical rule, the highest value for newspapers was 1.78 + 3 × 1.39 = 5.95, and the highest value for magazines was 24.73 + 3 × 30.42 = 116. Compared with Table 1, Table 3 shows that 273 newspapers and 832 magazines were removed as outliers.
Statistical summary of unit prices of filtered newspapers and magazines
3.1.2 Data transformation
We created numerical circulation indicators based on the categories. For example, the months were converted to a range of 1 to 12, the weeks to 1 to 52, and the days to 1 to 365.
3.1.3 Handling missing values
The missing values were handled by removing rows or columns with null values. However, there were 829 (187 newspapers + 642 magazines) blanks in the introduction of newspapers and magazines. Thus, another approach was to fill the blanks with the corresponding names.
3.1.4 Selection of best-selling indicators
The best-selling newspapers and magazines issued by China Post were selected by the Newspapers and Magazines Bureau of China Post Group Corporation from the tens of thousands of newspapers and magazines published and distributed nationwide. They were determined by calculating scores according to the number of recommendations, issue volume, year-on-year increase, and turnover. The selections were divided into six categories: current politics and finance, cultural integration, elderly health, family life, youth education, and fashion.
3.2 Descriptive Statistic
Distribution characteristics mainly analyze the frequency distribution, central type tendency, discrete trend skewness, and kurtosis. After removing outliers, 6,583 magazines and 1,205 newspapers were left. Newspapers and magazines differ considerably in price; hence, both categories were analyzed separately. The frequency distribution of newspapers is listed below:
The frequency distribution of magazines is listed below:
The publication frequency of newspapers was mainly weekly, and magazines were mainly monthly and bimonthly. A descriptive statistical analysis was performed on newspapers and magazines; the results are presented in Tables 4 and 5.
Descriptive statistics of newspapers
Descriptive statistics of magazines
Table 4 shows that the dispersion of the unit price was small; the yearly cost and publishing frequency exhibited a platykurtic distribution, while the unit price and the number of pages exhibited a leptokurtic distribution, all of which were right-skewed-distributed.
Table 5 shows that the publishing frequency dispersion of magazines was minor; they were all leptokurtic-distributed. The publishing frequency was left-skewed, and the rest were right-skewed.
3.3 Word Count and Keyword Statistics based on Brief Introduction and Product Categories
The newspaper and magazine types were grouped from different angles. The word count analysis of the “Introduction” and “product category” information approximately classified their types. The “Intro¬duction” uses different words and expressions to describe various newspapers and magazines briefly. For example, the “People's Daily” is described as “the organ of the Central Committee of the Communist Party of China.” The description of “China TV News” is “announcement and promotion of CCTV programs, while taking into account the satellite programs of various provincial stations, reporting the situation on the screen and behind the scenes and the latest developments in the film and television industry at home and abroad, and providing extended services related to programs.”
The top 15 keywords from the brief introduction of newspapers are listed below. In addition to news reports, newspapers currently focus on student-related content, and there has been a recent trend of an expanded distribution of these types.
The top 15 keywords from the brief introduction of magazines are listed below. The content of the magazines still mainly reflects domestic and international achievements as well as new trends in science and technology.
The Product Category descriptive words are not selected from the product structure of the hierarchical and content structure of magazines or newspapers. Moreover, the product category descriptions are not detailed. Therefore, overlap and confusion are inevitable in the classes. For example, university magazines are divided into finance, science and technology, agriculture, forestry, animal husbandry, medicine, philosophy, and natural sciences. Similarly, finance and economics have many subdivisions, such as finance, accounting, statistics, banking, securities, and insurance.
The top 15 keywords from the product category of newspapers are listed below. The most-occurred had an occurrence rate of 407/1,205 = 0.3378. The proportion of newspapers in market segments such as teenagers, senior citizen and women was high.
The top 15 keywords from the magazine product category are listed below. The highest-frequency word "Education" appeared in 1,038 magazines. Compared with the product categories of newspapers, those of magazines have more subdivisions, with education and colleges accounting for a large proportion. Medical and health research is gradually advancing, whereas geography and politics research is relatively rare.
3.4 Feature Selection and Generation
This section describes the steps involved in feature construction and selection. First, two new features, product category, and profile category, were constructed through word frequency and cluster analyses; then, the impact factors related to best-selling were selected through correlation analysis.
3.4.1 Word frequency matrix
In the experiment, word segmentation was performed on each newspaper and magazine, and the text type data were converted into a numerical type to construct feature variables. The “jieba” (结巴) library [13] for Chinese word segmentation and the function CountVectorizer() were used to vectorize the text. Each word obtained after text vectorization represented a different numbered feature. The resultant database still required a manual screening of the words that needed to be eliminated. Certain phrases with similar meanings were also merged. Keywords in the introduction generated 3,140 words from newspapers and 11,380 words from magazines, whereas keywords in the product category generated 71 words from newspapers and 180 words from magazines.
3.4.2 Cluster analysis
After building the word frequency matrix, each newspaper and magazine was converted into a containing only 0 and 1 second. Next, the K-means clustering algorithm was used for cluster analysis. The K-means algorithm divides N instances into K disjoint "sample clusters" so that each point belongs to the cluster corresponding to its nearest mean. The algorithm reassigns objects to the clusters closest to the updated standard through an iterative relocation process within a close spatial range [14]. K-means clustering can efficiently cluster data and is often used as a pre-processing step for other algorithms.
Newspapers were divided into five categories according to their content: comprehensive, professional, life services, readers, and abstracts. Magazines were divided into five categories from the product structure: philosophy–social sciences, culture–education, literature–arts, natural sciences–technology, and general—the number of types K was set to 5. Here, the t-distributed neighborhood embedding nonlinear dimensionality reduction algorithm [15] was used to reduce the dimensionality of the data before implementing the K-means algorithm to ensure that the points in the generated clusters were relatively concentrated.
The categories can also be manually categorized according to the content structure of newspapers and the product structure of magazines. As the process is cumbersome and lacks clear standards, automatic clustering methods were used in this study.
3.4.3 Correlation Analysis
After removing outliers, 181 best-selling magazines and 51 best-selling newspapers remained. A high correlation was found to exist between the best-selling status and unit price, yearly price, the number of pages, and publishing frequency. Figs. 2 and 3 illustrate the results of the correlation analysis using the Pearson correlation coefficient [16] to measure the information from newspapers and magazines, respectively.
Correlation between factors of newspapers.
Correlation between factors of magazines.
The popularity of newspapers was positively correlated with the yearly price, publishing frequency, and the number of pages, and the popularity had a weak negative correlation with the product cluster. The unit price was positively correlated with the yearly price and number of pages but negatively correlated with the publishing frequency.
For magazines, best-selling status was positively related to publishing frequency. The unit price, the number of pages, and the introduction cluster correlated more with best-selling status than the yearly price and product cluster. The unit price was positively correlated with the yearly price and the number of pages but negatively correlated with publishing frequency.
3.4.4 Feature selection
In addition to correlation analysis, we chose the unit price, publishing frequency, number of pages, and product clustering as the influencing factors of whether a newspaper sells well. Similarly, we chose the annual price, publishing frequency, number of pages, and introduction clustering as the influencing factors of whether a journal sells well. The correlation between the two clustering categories and whether they were famous was low. On the one hand, the automatic clustering method was defective; on the other hand, the introduction of the relevant text was too concise and could not contain too much information. Regardless, both were selected as the candidate factors.
4. Classification and Prediction
This section describes the building and training of a machine learning model for predicting whether newspapers and magazines sell well.
4.1 Data Processing and Model Building
4.1.1 Feature and target variables extraction
The target variable was set to “sell well” or “not,” and the independent feature variables were yearly price, publishing frequency, and product category.
4.1.2 Training and test sets division
The sample proportion of the target variable was highly unbalanced. The ratio of best-selling magazines was 181/6,583 = 2.75%, and the percentage of best-selling newspapers was 51/1,205 = 4.23%. In this case, it was necessary to construct the dataset by adopting the minority class sample oversampling or the majority class sample undersampling method. The synthetic minority oversampling technique (SMOTE) [17] was used to analyze the best-selling samples of the minority. The technique artificially synthesized new models according to the best-selling pieces to augment the dataset. Undersampling divided the majority class samples into equal parts with the number of instances of the minority class and formed the dataset in batches with the minority samples. The size of the dataset constructed by the undersampling method was limited, significantly impacting the experimental results. Hence, the SMOTE oversampling method was selected for the investigation. Owing to the limited sample, the k-fold cross-validation was used to evaluate the machine learning model. The dataset was shuffled randomly and split into five groups.
4.1.3 Classification Models Building
Three classification machine learning methods were used: support vector machine (SVM) [18], multilayer perceptron (MLP) neural network [19], and decision trees (DTs) [20] for model building.
When a training set is extensive or has many features, the linear kernel function (SVM without kernel function) is often used. MLP is a multilayer feedforward network model composed of multiple node layers. The MLP neural network overcomes the inability of the perceptron to identify inseparable linear data, can continuously optimize the network model through adaptive learning, and has strong computing power. DTs are constructed using a heuristic method called recursive partitioning.
4.2 Predicted Classification
The results presented below are the average results of 20 training and testing runs. The prediction accuracies under different factors were tested separately.
Prediction accuracy in a single-factor case: (a) newspaper and (b) magazine.
4.2.1 Single-factor case
For newspapers and magazines, the predicted accuracies (mean and standard deviation) of each factor related to the best-selling status are presented in Fig. 4. Yearly price and unit price show better classification performance in newspapers and magazines, respectively.
4.2.2 Two-factor case
In the combination of two factors, as shown in Fig. 5, the accuracies of the price mixture (yearly prices or unit prices) or clustering-based feature mixture (product or introduction) were improved compared with those of the previous single-factor case. The combination of publishing frequency and number of pages also generally increased relative to the case of a single factor.
Prediction accuracy in the case of two factors: (a) newspaper and (b) magazine.
4.2.3 Three-factor case
Fig. 6 shows that using a combination of three factors was more accurate than using a variety of two factors or a single factor.
4.2.4 Four-factor and all-factors cases
Four and all factors were considered as features for training and testing, and the accuracies obtained are illustrated in Fig. 7. The figure shows that using a combination of the four correlated factors yielded significantly better results than those of fewer combinations. When all elements were combined, the results slightly decreased for DTs and increased for SVM and MLP, especially in the magazine test.
Prediction accuracy in the case of three factors: (a) newspaper and (b) magazine.
Prediction accuracy in the case of four and all factors.
4.2.5 Discussion
As shown in Figs. 4–7, the corresponding prediction accuracy increases with the number of influencing factors. Under the same conditions, the accuracy of using DTs is considerably higher than that of using SVM and MLP. As the data dimensions increase, the performance of SVM and MLP improves. DTs are prone to overfitting; hence, when all the factors are used, the accuracy of using DTs decreases slightly, but that of SVM and MLP considerably improves. Using a random forest model can solve the problem of overfitting when all factors are considered. In all the combinations, the price factor shows a high importance rate. Cluster-based factors are not the worst performers and can improve the corresponding accuracy when combined with other factors; however, the single-factor performance is not outstanding. The main reason for the poor performance is that the data are too sparse during clustering.
Overall, we should focus on different combinations of factors for newspapers and magazines. For newspapers, yearly price and product category should be focused on, whereas unit price and introduction should be focused on magazines. Nearly 15.5% of the missing newspaper introduction values are replaced by the newspaper's name, resulting in an insignificant impact on newspaper sales. Meanwhile, an extremely broad product categorization of magazines results in a less significant effect on magazine sales.
5. Competitive Strategy
With the continuous deepening of industrialization, enterprise conglomeration, and marketization of news media, the newspaper industry must continuously improve its core competitiveness. In the new era of a rapid expansion of online media at minimal cost, the newspaper industry must explore and research connotations to pursue the coordination of “quality” and “quantity.”
In terms of content, it is necessary to emphasize social benefits and implement newspapers and magazines to be close to reality, life, and the masses, and to improve the system and mechanism that integrate social and economic benefits.
In terms of price, the principle of healthy competition should be followed; malicious low-price competition and predatory pricing should be avoided, although price discrimination can be used while different prices are charged to varying audiences in other regions.
In terms of quality, the construction of high-quality content publishing and dissemination capabilities should be strengthened, and content carriers, methods, formats, systems, and mechanisms should be innovated.
In terms of positioning, the content, style, and value orientation of the newspaper industry must be audience-oriented.
6. Conclusion
This paper reports on fundamental, statistical descriptive, and predictive analyses conducted by mining the postal catalogs of newspapers and magazines. Word frequency and cluster analyses were performed on profiles and product categories, combined with correlation analysis for constructing and selecting features. Finally, three classification models were trained and used to predict factors that affect the best-selling status of newspapers and magazines. As the dataset was small, the knowledge gained through data mining was limited. Moreover, similar relevant research and corresponding experimental control were lacking. Nevertheless, the data analysis scheme used in the study was feasible and provided comple¬mentary strategies for pricing, distribution, and content for newspapers and magazines to enhance their competitiveness.