1. Introduction
Prevalent explanation for knowledge base is a self-serve customer service library stored in computer system, including information or appropriate solutions about specific area. Knowledge of specific domain is structured representation explored or mined from unstructured data. Knowledge also can be represented by rules, and rule-based knowledge base can provide new knowledge to the users of a decision support system [1]. A deep learning approach for knowledge discovery to power system security area is reported in [2]. These knowledges can be structured and stored in knowledge base. Knowledge discovery with its base construction in many other areas [3] showed that the data exploration was more and more refined and the effectiveness of utilizing information was closer to the reality needs of specific domain.
Autism also known as autism spectrum disorder (ASD), refers to a broad range of conditions characterized by challenges with social skills, repetitive behaviors, speech and nonverbal communication. Although many research experiments [4] from different perspectives, we still know little information about it, for example, we may want to know which proteins metabolism or genes play decisive role in the pathogenesis process and the relations between them. Fortunately, big data provided in many researches gives us a chance to deep investigate and mining from these massive literatures. We are coming closer to completely understand the disease autism with the help of prevalent machine learning algorithms. This paper focuses on the knowledge of biological molecular information extracted from biomedical texts with the aid of widespread conditional random fields (CRFs) methods. Protein, DNA, RNA, cell line, cell component, and cell type, etc., 6 classes of biological molecular information are concerned. The knowledge base can help biologists to etiological analysis and pharmacists to drug development.
2. Corpus and Preprocessing
The GENIA term annotation was provided by GENIA Project, which was founded by Prof. Jun'ichi Tsujii and ran at the Tsujii Laboratory of University of Tokyo from 1998 to 2012. The corpus is a collection of 1999 biomedical abstracts [5] in Molecular Biology Domain and 38 classes terms were annotated to help machine learn the biological knowledge. We concentrated on the 6 classes of terms such as protein molecule, DNA molecule, RNA molecule, cell line, cell type, cell component.
Original GENIA term annotation corpus is formatted in xml file. This can help us extract the useful parts by the corresponding mark, e.g., the mark <cons> indicates some knowledge including in the flowing annotation, <title> point out the following is the title of the abstract.
The MEDLINE number, title, content, protein molecule, DNA molecule, RNA molecule, cell line, cell type, and cell component, etc., six annotated terms were extracted from the sentences with the regular expressions and structured in the csv file, which can be downloaded from the website http://134.175.110.97/bioinfo/index.jsp. Each row in the file denoted an article with 6 columns corresponding to different information. In order to discover the knowledge from the corpus, the csv file was further organized in samples to make preparation for learning process.
Each row is a sample with the token and the tag, while the tag was the class label of the token. Maybe some term contains multi-tokens, we utilize the traditional representation as B-I-O methods. Naturally, the tags include 13 type of labels for the 6 classes. These samples were trained in the CRF algorithm and the model as the rules was yielded for knowledge discovery from literatures associated with the disease autism.
3. Methods
3.1 Fundamental of Mathematics
Conditional random fields, a kind of structured prediction methods, are essentially a combination of classification and graphical model [6]. Much work in learning with graphical models that explicitly model a joint probability distribution [TeX:] $$p(y, x)$$ in Eq. (1) over outputs and inputs. However, it is difficult to model joint probability for the dimensionality of [TeX:] $$x$$ is very large and the features may have complex dependencies under many circumstances. Fortunately, CRFs as a discriminative approach, model the conditional distribution [TeX:] $$p(y| x)$$ in Eq. (2) directly. Where [TeX:] $$Z(x)=\sum_{y} \exp \left(\theta_{y}+\sum_{j=1}^{K} \theta_{y, j} x_{j}\right)$$ is an normalized constant, and [TeX:] $$\theta_{y}$$ is a bias weight that acts like [TeX:] $$\log p(y)$$ in naïve Bayes.
Rather than using one weight vector per class, as in Eq. (2), we can use a different notation in which a single set of weights is shared across all the classes. A set of feature functions is defined as [TeX:] $$f_{y^{\prime}, j}(y, x)=1_{\left[y^{\prime}=y\right]} x_{j}$$ that is nonzero only for a single class, in practice. Naturally, we can use [TeX:] $$f_{k}$$ to index each feature function [TeX:] $$f_{y^{\prime}, j}$$ , and to index its corresponding weight [TeX:] $$\theta_{y^{\prime}, j}$$. The Eq. (2) can be rewritten as (3).
CRFs combine the ability of graphical models to compactly model multivariate data with the ability of classification methods to perform prediction using large sets of input features.
3.2 Roadmap of the Method
Knowledge base construction associated autism method can be described in Fig. 1. The GENIA corpus, as the inputs, has been transformed into samples in the previous preprocessing. There are 18,749 sentences including 508,645 tokens in the corpus.
As mentioned above, feature functions are important for CRF algorithm to learn model and rules from these samples. In this work, we set two types feature functions i.e. state transition functions [TeX:] $$f\left(y_{i-1}, y_{i}\right)$$ and state observation functions [TeX:] $$f\left(y_{i}, x_{i}\right)$$. Although the long range dependency may give a little improved to performance, we only consider the dependency between two neighbor classes, which called Markov property. These samples are classified into 13 classes, so the number of transition functions is 169. State observation functions embrace current observations and its context. We set the following features including [TeX:] $$f\left(y_{i}, x_{i-2}, x_{i}\right), f\left(y_{i}, x_{i-1}, x_{i}\right), f\left(y_{i}, x_{i}\right), f\left(y_{i}, x_{i}, x_{i+1}\right), f\left(y_{i}, x_{i}, x_{i+2}\right), f\left(y_{i}, x_{i-2}, x_{i-1}, x_{i}\right), f\left(y_{i}, x_{i}, x_{i+1}, x_{i+2}\right)$$ . Dependency between the tokens in context with distance five is modeled in the work. The total number of the feature functions is 5519618 after the statistic computing.
Knowledge base construction associated with autism flow diagram.
CRF learning is process of utilizing toolkit to compute the weights for these feature functions and find the rules for knowledge discovery. CRF++0.58 toolkit is a simple and open source implementation of CRFs, developed for a variety of natural language processing (NLP) tasks. To evaluate the performance the method, we split the whole corpus into training set and test set with four different proportion of 0.6, 0.7, 0.8, and 0.9, and take CRF++ over each training set to learn model. At the end, we take the CRF++ tool to learn the model over all the samples and perform 453 iterations to final convergence. Fig. 2 shows the tag error and sentence error rate decrease with first 100 iterations by CRF++ tool. Tag error rate descends rapidly and converges to a little value with the aid of the set of feature functions. Although sentence error rate decrease is slowly, it also gets convergence in the final iterations. The final tag error rate is 0.0022 and sentence error rate is 0.03291.
Tag error and sentence error rate decrease with first 100 iterations by CRF++ tool.
Table 1 is the performance of 12 classes (not including the O class) identified by CRF++ tool over four different scale test sets with 0.4, 0.3, 0.2, and 0.1. Results in the table shows that there is no obvious difference between test sets, although their models are trained from the different training sets. These illustrate the stability of the performance. The performance does not depend on the number of the samples. Precision of class B-protein can achieve 0.85 and recall 0.62 in test set of 0.1 for the amount of B-protein is the most among the classes. This demonstrate that the protein identification is more faithful than others.
Results in Table 2 are the performance of completed entity merged from the test sets. Just as expected, the performance is lower than that in Table 1. Maybe, there is a lot space for improvement, but this does not affect we extract rough knowledge from the literatures. At the end, we learn the model from all the GENIA corpus to enlarge the training set extremely. Weights of each feature function are computed and stored in the model to extract knowledge from literature associated with autism.
Literatures associated with autism are downloaded from PubMed website with the key word autism, and the total number of abstracts is 42,997. There are 6,684,747 tokens in the literatures. These literatures with the model learned from the GENIA corpus as the inputs to CRF++ tool and the annotations of the literatures were predicted. Finally, the knowledge discovery programs give the knowledge related to autism from the prediction result. The model learned from GENIA corpus represent the rules of 6 classes. So, we can easily identify the protein, DNA, RNA, cell line, cell type and cell component from original
Performance of original results through CRF++ over the GENIA corpus
text and count its frequency naturally. Table 3 is top ten of 6 classes term and its frequency, and all the terms also can be found in the project website. Four classes terms such as protein, DNA, cell type and cell component showed in Table 3 are very representativeness from the frequency perspective. In the following discussions, we will elaborate on the validation of these four classes of terms.
Performance of completed entity merged from GENIA corpus
4. Discussions
We concentrate on the 4 classes terms for its frequency is larger 30, such as protein, DNA, cell type and cell component. Among those proteins, interleukin 6 (IL-6) has the highest frequency 232. This strongly suggest that IL-6 related to the disease autism is high probability. We can find the evidence from the literature [7] directly. In this article, IL-6 is reported that it is increased in the cerebellum of autistic brain and alters neural cell adhesion, migration and synaptic formation. Full name of IL-6 is interleukin-6, which plays the crucial role in the development of autism. Recent evidence shows that localized inflammation of the central nervous system (CNS) may lead to autism and IL-6 just contribute to the process. Other proteins in Table 3 also can be proved height correlation to autism.
Among the DNAs in Table 3, we focus on the top frequency DNA, X chromosome. In fact, chromosome X is same as chromosome X. So, its frequency should be 103. [7] showed that X-linked (XL) inheritance or maternal skewed X-chromosome inactivation (XCI) is presenting with autism, using a home-made X-chromosome-specific microarray covering the whole human X-chromosome at high resolution. Evidences [8] also indicate the presence of X-linked susceptibility genes in human with autism and conclude TBL1X gene in X-chromosome may play a role in autism risk. These prove high correlation between autism and X-chromosome.
Microglia, a kind of cell type, is also proved important to autism in [9, 10] . Microglia is critical to the development of normal neural networks, and abnormal microglia often present in autism. Maternal
Top 10 of six classes term and its frequency
The values in parentheses represent frequency.
immune activation and microglial dysfunction in the developing brain have been gaining mounting evidence and leading to potential treatment options.
The component nucleus including caudate nucleus, reticular thalamic nucleus, bed nucleus supraoptic nucleus and paraventricular nucleus, etc., is a cluster of cell bodies of neurons in the central nervous system. Autism is a complex disorder of the central nervous system and the condition has a wide range of severity along its spectrum. In addition, nucleus is annotated as cell component in the GENIA corpus. Naturally, nucleus presented in the literatures can be identified precisely.
As mentioned above, we verify the validation of the knowledge extracted from the original literatures associated with autism and construct the knowledge base, which can at least answer the four questions in QA system, i.e., which proteins are most related to the disease autism, which DNAs play important role to the development of autism, which cell types have the correlation to autism and which cell components participate the process to autism.
5. Conclusions
This work attempt to construct knowledge base associated with disease autism using CRF learning, the widespread probabilistic statistic method. Firstly, we extract protein, DNA, RNA, cell line, cell type, cell component, 6 classes of molecular information from GENIA corpus and format into samples to feed to CRF++ tool. Secondly, we utilize the model learned from the GENIA corpus to find the 6 classes of molecular information from literatures related to autism. Thirdly, knowledge discovery program can seek what is the most high correlated to development of autism and its therapy. If we construct a QA system, we at least can answer the four questions, which proteins are related to the disease autism, which DNAs play important role to the development of autism, which cell types have the correlation to autism and which cell components participate the process to autism.
Acknowledgement
This paper is supported by the project (No.16KJD52003) of Jiangsu Province education department.