- Research article
- Open Access
Identification of genomic features in the classification of loss- and gain-of-function mutation
© Jung et al.; licensee BioMed Central Ltd. 2015
- Published: 20 May 2015
Alterations of a genome can lead to changes in protein functions. Through these genetic mutations, a protein can lose its native function (loss-of-function, LoF), or it can confer a new function (gain-of-function, GoF). However, when a mutation occurs, it is difficult to determine whether it will result in a LoF or a GoF. Therefore, in this paper, we propose a study that analyzes the genomic features of LoF and GoF instances to find features that can be used to classify LoF and GoF mutations.
In order to collect experimentally verified LoF and GoF mutational information, we obtained 816 LoF mutations and 474 GoF mutations from a literature text-mining process. Next, with data-preprocessing steps, 258 LoF and 129 GoF mutations remained for a further analysis. We analyzed the properties of these LoF and GoF mutations. Among the properties, we selected features which show different tendencies between the two groups and implemented classifications using support vector machine, random forest, and linear logistic regression methods to confirm whether or not these features can identify LoF and GoF mutations.
We analyzed the properties of the LoF and GoF mutations and identified six features which have discriminative power between LoF and GoF conditions: the reference allele, the substituted allele, mutation type, mutation impact, subcellular location, and protein domain. When using the six selected features with the random forest, support vector machine, and linear logistic regression classifiers, the result showed accuracy levels of 72.23%, 71.28%, and 70.19%, respectively.
We analyzed LoF and GoF mutations and selected several properties which were different between the two classes. By implementing classifications with the selected features, it is demonstrated that the selected features have good discriminative power.
- Support Vector Machine
- Subcellular Location
- Random Forest
- Hypergeometric Test
- Fumarate Hydratase
A mutation refers to a change of the genomic sequence, which contains all of the genetic information of an organism. Because proteins are generated and regulated based on the genome sequence, alterations of the genome can lead to changes of protein functions . Through these genetic mutations, a protein can loss its native function (loss-of-function), or it can confer a new function (gain-of-function) [2–5]. For example, a mutated fumarate hydratase (FH) loses its native catalytic activity , and heterozygous point mutations in isocitrate dehydrogenase (IDH1, IDH2) confer a new metabolic enzymatic activity that produces 2-hydroxyglutarate [7, 8]. In addition, in the FGFR1 gene, GoF and LoF mutations can lead to different diseases, craniosynostosis and Kallmann syndrome, respectively [9–12]. Therefore, it is important to understand the characteristics of functional mutations and to determine which mutations lead to LoF and GoF results for clinical target.
There are many studies of mutations, including LoF and GoF mutations. MacArthur et al. implemented a systematic survey of LoF variants. They showed many LoF variant properties compared to other mutations, such as the allele frequency and the degrees of associations with diseases. They also showed the effects of LoF variants, including phenotypes, diseases, and gene expressions. However, missense mutations were excluded from the LoF mutations which they defined . In our study, many mutations were missense mutations; therefore, we chose to address missense mutations. Reva et al. estimated functional effects of missense mutations using evolutionary conservation information . Lee et al. discussed the bi- directional SIFT (B-SIFT), which is a modified form of SIFT. In addition, the B-SIFT algorithm calculates scores of mutation alleles based on evolutionary conservation information . They used the scores to identify mutations which cause hyperactivation or gain-of-function outcomes, but our work uses not only the functional effects of mutations but also several other properties. However, most previous studies focused on either LoF or GoF mutations or on functional changes in a specific gene.
Mutation information with the experimentally characterized LoF and GoF outcomes was collected from the literature. In this case, we extracted information about LoF and GoF outcomes from PubMed. First, we searched all PubMed abstracts which contain the acronyms "GOF" and "LOF" and the words "gain of function," "gain-of-function," "loss of function" and "loss-of-function" as keywords. Then, we found all related genes for each abstract and sentence containing the relevant genes using Gene2Pubmed . Next, we tagged the mutations and their locations using tmVar, which is a previously published software, to extract mutation information from the literature : substitutions, insertions, deletions, and SNP and frameshift mutations of DNA and protein sequences, so we also used CRF features as mentioned in the reference .
Once the LoF and GoF mutation information was collected from the literature, the mutation data set was preprocessed for further analysis. First, we selected LoF and GoF mutations that were published after the year 2010 in order to filter out mutations which had been identified against older versions of the reference genome. Second, mutations that are represented with an amino acid information were converted into those with a genomic location using the exon and intro information in Consensus CDS (CCDS, using GRCh37.p13) [16–18] so that we could observe the differences between LoF and GoF mutations at the nucleotide sequence level. Also, amino acid residues were converted into 3-mer nucleotide alleles by incorporating the CCDS nucleotide sequence and the amino acid codon table. In addition, substituted mutations were subgrouped into missense, nonsense and silent mutations according to the amino acid residue. During this step, silent mutations were removed. After these preprocessing steps, 258 LoF mutations and 129 GoF mutations remained.
In order to evaluate the significance of missense mutation effects on protein functions, the functional impact scores (FIS) of the LoF and GoF mutations were evaluated using the FIS method , which calculates the significance scores of point mutations based on evolutionary conservation of the mutation sites. We used the following information as the input of the FIS method: hg version, chromosome, mutation location, reference allele, substituted allele.
To demonstrate the relationships between the protein domain functions and the LoF and GoF mutations, we used the Pfam database , which makes available large-scale protein domain information about proteins. We found protein domains corresponding to LoF and GoF mutation locations and formulated a distribution of the protein domains of the two classes with all 55931 human genes as the background. With this information, we performed a hypergeometric test to find the protein domains which had significantly different distributions.
Input format example.
Apical cell membrane
Tagged mutations from the literature
We obtained 14,259 gene-sentence relationships for GoF and 29,586 relationships for LoF. From these relationships, genes which did not have sentences were removed. We obtained 2,142 sentences for GoF and 4,600 sentences for LoF as a result. Next, tmVar  found 474 mutations for GoF and 816 mutations for LoF. Consequently, we obtained 474 mutations from 2,142 sentences for GoF and 816 mutations from 4,600 sentences for GoF.
Next, we extracted mutation subtypes from the LoF and GoF mutations and compared their distributions. In this work, we used six types of mutations: missense, nonsense, deletion, indel, duplication and frame shift. Figure 3B shows the distribution of the mutation subtypes of LoF and GoF. MacArthur et al. studied LoF variants, but did not focus on the missense mutations . However, our study shows that the most frequently found type of mutation is the missense mutation in both cases for LoF and GoF mutations. This ratio indicates that missense mutations are also an important proportion of the mutations which affect protein functions. The second most frequently found mutation is the nonsense mutation in LoF; for GoF, it was the deletion mutation. These results indicate that nonsense mutations usually lead to a protein which causes a loss of function and not a gain of a new function.
Reference and substituted allele ratio
Classification of LoF versus GoF with selected features
By mining the literature, 14,259 gene-sentence relationships for GoF and 29,586 relationships for LoF were collected. From these, genes without sentences were removed. Thus, 2,142 sentences for GoF and 4,600 sentences for LoF remained. We then tagged the mutations and their locations from the sentences. Consequently we obtained 474 mutations from 2,142 sentences for GoF and 816 mutations from 4,600 sentences for GoF. In addition, during the data-preprocessing step, mutations whose reference allele was not matched with a reference genome or mutations published before 2009 were removed. Hence, we analyzed 258 LoF mutations and 129 GoF mutations. As a result, we found six features which can distinguish LoF and GoF mutations: the subcellular location, the mutation subtype, the reference and substituted allele, the functional impact, and the protein domain. We used these features for classification to confirm whether or not they can identify LoF and GoF mutations. Finally, we obtained 72.23% accuracy for the random forest, 71.28% accuracy for the support vector machine, and 70.19% accuracy for the linear logistic regression methods, with AUC values of 0.7880, 0.7128, and 0.7646, respectively. As a result, we can conclude that the selected features can contribute to the identification of LoF and GoF mutations.
Since the LoF and GoF mutation data were derived from the literature, the number of mutation data was limited and was not enough to understand overall tendency of the LoF and GoF mutations. In addition, although we selected mutations that were published after the year 2010, there were mutations not matched with the reference genome. In this work, we studied the LoF and GoF mutation properties, and we expect that this study can contribute to better understanding of the mutation effects on the biological systems. Through the analysis of associations between mutations and protein functions and the analysis of how the affected proteins influence the biological pathways, we can clarify biological mechanism from mutations to systems.
This work was supported by GIST College's 2014 GUP Research Fund and by the "Systems Biology Infrastructure Establishment Grant" provided by the Gwangju Institute of Science & Technology in 2014.
Publication costs for the article were sourced from GIST College's 2014 GUP Research Fund and by the "Systems Biology Infrastructure Establishment Grant" provided by the Gwangju Institute of Science & Technology in 2014.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 15 Supplement 1, 2015: Proceedings of the ACM Eighth International Workshop on Data and Text Mining in Biomedical Informatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedinformdecismak/supplements/15/S1.
- Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, et al: Mutational landscape and significance across 12 major cancer types. Nature. 2013, 502 (7471): 333-339. 10.1038/nature12634.View ArticlePubMedPubMed CentralGoogle Scholar
- Benjannet S, Hamelin J, Chretien M, Seidah NG: Loss- and gain-of-function PCSK9 variants: cleavage specificity, dominant negative effects, and low density lipoprotein receptor (LDLR) degradation. The Journal of biological chemistry. 2012, 287 (40): 33745-33755. 10.1074/jbc.M112.399725.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee W, Zhang Y, Mukhyala K, Lazarus RA, Zhang Z: Bi-directional SIFT predicts a subset of activating mutations. PloS one. 2009, 4 (12): e8311-10.1371/journal.pone.0008311.View ArticlePubMedPubMed CentralGoogle Scholar
- MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, et al: A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012, 335 (6070): 823-828. 10.1126/science.1215040.View ArticlePubMedPubMed CentralGoogle Scholar
- Strano S, Dell'Orso S, Mongiovi AM, Monti O, Lapi E, Di Agostino S, Fontemaggi G, Blandino G: Mutant p53 proteins: between loss and gain of function. Head & neck. 2007, 29 (5): 488-496. 10.1002/hed.20531.View ArticleGoogle Scholar
- Xiao M, Yang H, Xu W, Ma S, Lin H, Zhu H, Liu L, Liu Y, Yang C, Xu Y, et al: Inhibition of alpha-KG-dependent histone and DNA demethylases by fumarate and succinate that are accumulated in mutations of FH and SDH tumor suppressors. Genes & development. 2012, 26 (12): 1326-1338. 10.1101/gad.191056.112.View ArticleGoogle Scholar
- Lu C, Ward PS, Kapoor GS, Rohle D, Turcan S, Abdel-Wahab O, Edwards CR, Khanin R, Figueroa ME, Melnick A, et al: IDH mutation impairs histone demethylation and results in a block to cell differentiation. Nature. 2012, 483 (7390): 474-478. 10.1038/nature10860.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu W, Yang H, Liu Y, Yang Y, Wang P, Kim SH, Ito S, Yang C, Wang P, Xiao MT, et al: Oncometabolite 2-hydroxyglutarate is a competitive inhibitor of alpha-ketoglutarate- dependent dioxygenases. Cancer cell. 2011, 19 (1): 17-30. 10.1016/j.ccr.2010.12.014.View ArticlePubMedPubMed CentralGoogle Scholar
- Dode C, Levilliers J, Dupont JM, De Paepe A, Le Du N, Soussi-Yanicostas N, Coimbra RS, Delmaghani S, Compain-Nouaille S, Baverel F, et al: Loss-of-function mutations in FGFR1 cause autosomal dominant Kallmann syndrome. Nature genetics. 2003, 33 (4): 463-465. 10.1038/ng1122.View ArticlePubMedGoogle Scholar
- Hu Y, Bouloux PM: Novel insights in FGFR1 regulation: lessons from Kallmann syndrome. Trends in endocrinology and metabolism: TEM. 2010, 21 (6): 385-393. 10.1016/j.tem.2010.01.004.View ArticlePubMedGoogle Scholar
- Ibrahimi OA, Zhang F, Eliseenkova AV, Linhardt RJ, Mohammadi M: Proline to arginine mutations in FGF receptors 1 and 3 result in Pfeiffer and Muenke craniosynostosis syndromes through enhancement of FGF binding affinity. Human molecular genetics. 2004, 13 (1): 69-78.View ArticlePubMedGoogle Scholar
- Zhou YX, Xu X, Chen L, Li C, Brodie SG, Deng CX: A Pro250Arg substitution in mouse Fgfr1 causes increased expression of Cbfa1 and premature fusion of calvarial sutures. Human molecular genetics. 2000, 9 (13): 2001-2008. 10.1093/hmg/9.13.2001.View ArticlePubMedGoogle Scholar
- Reva B, Antipin Y, Sander C: Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic acids research. 2011, 39 (17): e118-10.1093/nar/gkr407.View ArticlePubMedPubMed CentralGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic acids research. 2007, 35 (Database): D26-31. 10.1093/nar/gkl993.View ArticlePubMedGoogle Scholar
- Wei CH, Harris BR, Kao HY, Lu Z: tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013, 29 (11): 1433-1439. 10.1093/bioinformatics/btt156.View ArticlePubMedPubMed CentralGoogle Scholar
- Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, et al: Current status and new features of the Consensus Coding Sequence database. Nucleic acids research. 2014, 42 (Database): D865-872.View ArticlePubMedGoogle Scholar
- Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, et al: Tracking and coordinating an international curation effort for the CCDS Project. Database : the journal of biological databases and curation. 2012, 2012: bas008-View ArticlePubMedGoogle Scholar
- Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, et al: The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome research. 2009, 19 (7): 1316-1323. 10.1101/gr.080531.108.View ArticlePubMedPubMed CentralGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, et al: Pfam: the protein families database. Nucleic acids research. 2014, 42 (Database): D222-230.View ArticlePubMedGoogle Scholar
- Bakheet TM, Doig AJ: Properties and identification of human protein drug targets. Bioinformatics. 2009, 25 (4): 451-457. 10.1093/bioinformatics/btp002.View ArticlePubMedGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explor Newsl. 2009, 11 (1): 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- UniProt C: Activities at the Universal Protein Resource (UniProt). Nucleic acids research. 2014, 42 (Database): D191-198. [1-22]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.