DNA序列新特征的提取方法及其在重组位点识别中的应用Extraction Method of New Features of DNA Sequence and Its Application in Recombination Spots Identification
程丽荣,赵熙强
摘要(Abstract):
为提升重组位点识别的预测性能,本文提出了一种新的特征提取方法来识别重组位点。分别利用Word2Vec模型编码的3-gram向量和DNA特性获得两组表示DNA序列的新特征,与已有的特征(FastText模型获取)进行组合来表示DNA序列,使用支持向量机为分类算法,在基准数据集上进行5倍交叉验证。研究表明,本文提出的方法在识别重组位点方面获得了93.88%的敏感性、95.08%的特异性、94.54%的准确率和0.890 2的马修斯相关系数,以上指标均优于现有的方法,本文所提出的方法为解决生物学的序列信息提取问题提供了一种新思路。
关键词(KeyWords): DNA序列;重组位点;Word2Vec模型;词向量;3-gram;二核苷酸属性;支持向量机
基金项目(Foundation): 国家自然科学基金项目(11271341)资助~~
作者(Author): 程丽荣,赵熙强
参考文献(References):
- [1] Paul P,Nag D S,Chakraborty S.Recombination hotspots:Models and tools for detection[J].DNA Repair,2016,40:47-56.
- [2] Jensen-Seaman M I,Furey T S,Payseur B A,et al.Comparative recombination rates in the rat,mouse,and human genomes[J].Genome Research,2004,14:528-538.
- [3] Dong C,Yuan Y,Zhang F,et al.Combining pseudo dinucleotide composition with the Z curve method to improve the accuracy of predicting DNA elements:A case study in recombination spots[J].Molecular BioSystems,2016,12:2893.
- [4] Gerton J L,DeRisi J,Shroff R,et al.Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae[J].Proceedings of the National Academy of Sciences,2000,97:11383-11390.
- [5] Lefeuvre P,Lett J M,Varsani A,et al.Widely conserved recombination patterns among single-stranded DNA viruses[J].Journal of Virology,2009,83:2697-2707.
- [6] Sang F,Wu H,Wei J,et al.RF-DYMHC:Detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features[J].Nucleic Acids Research,2007,35:47-51.
- [7] Zhou T,Weng J,Sun X,et al.Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition[J].BMC Bioinformatics,2006,7:223.
- [8] Liu G,Liu J,Cui X,et al.Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae[J].Journal of Theoretical Biology,2012,293:49-54.
- [9] Chou K C.Some remarks on protein attribute prediction and pseudo amino acid composition[J].Journal of Theoretical Biology,2011,273:236-247.
- [10] Feng P M,Chen W,Lin H,et al.iRSpot-PseDNC:Identify recombination spots with pseudo dinucleotide composition[J].Nucleic Acids Research,2013,41:68.
- [11] Qiu W R,Xiao X,Chou K C.iRSpot-TNCPseAAC:Identify recombination spots with trinucleotide composition and pseudo amino acid components[J].International Journal of Molecular Sciences,2014,15:1746.
- [12] Liu G,Xing Y,Cai L.Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae[J].Journal of Theoretical Biology,2015,382:15-22.
- [13] Long R,Wang S,Liu B,et al.iRSpot-EL:Identify recombination spots with an ensemble learning approach[J].Bioinformatics,2016,33:35-41.
- [14] Zhang L,Kong L.iRSpot-ADPM:Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components[J].Journal of Theoretical Biology,2018,441:1-8.
- [15] Yang H,Qiu W R,Liu G,et al.iRSpot-Pse6NC:Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC[J].International Journal of Biological Sciences,2018,14:883.
- [16] Al Maruf M A,Shatabda S.iRSpot-SF:Prediction of recombination hotspots by incorporating sequence based features into Chou's Pseudo components[J].Genomics,2018,111:966-972.
- [17] Khan Z U,Ali F,Khan I A,et al.iRSpot-SPI:Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou's 5-step rule and pseudo components[J].Chemometrics and Intelligent Laboratory Systems,2019,189:169-180.
- [18] Asgari E,Mofrad M R K.Continuous distributed representation of biological sequences for deep proteomics and genomics[J].PLoS One,2015,10:e0141287.
- [19] Do D T,Le N.A sequence-based approach for identifying recombination spots in saccharomyces cerevisiae by using hyper-parameter optimization in fasttext and support vector machine[J].Chemometrics and Intelligent Laboratory Systems,2019,194:103855.
- [20] Chen W,Lin H,Chou K C.Pseudo nucleotide composition or PseKNC:An effective formulation for analyzing genomic sequences[J].Molecular Biosystems,2015,11:2620-2634.
- [21] Zhang X,Xin L,Qian S,et al.Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data[J].BMC Bioinformatics,2006,7(1):197.
- [22] Hamid M N,Friedberg I.Identifying antimicrobial peptides using word embedding with deep recurrent neural networks[J].Bioinformatics,2019,35:2009-2016.
- [23] Mikolov T,Chen K,Corrado G,et al.Efficient estimation of word representations in vector space[J].Computer Science,2013,1301:3781.
- [24] Cortes C,Vapnik V.Support-vector networks[J].Machine Learning,1995,20(3):273-297.
- [25] Chang C C,Lin C J.LIBSVM:A library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology,2011,2:27.
- [26] Zhang L,Kong L.iRSpot-PDI:Identification of recombination spots by incorporating dinucleotide property diversity information into Chou′s pseudo components[J].Genomics,2019,111(3):457-464.
文章评论(Comment):
|
||||||||||||||||||
|
||||||||||||||||||