广东工业大学学报 ›› 2023, Vol. 40 ›› Issue (01): 1-9.doi: 10.12052/gdutxb.220055

• •    下一篇

基于复合编码特征LSTM的基因甲基化位点预测方法

刘冬宁, 王子奇, 曾艳姣, 文福燕, 王洋   

  1. 广东工业大学 计算机学院,广东 广州 510006
  • 收稿日期:2022-03-23 出版日期:2023-01-25 发布日期:2023-01-12
  • 通信作者: 王洋(1988-),男,助理研究员,博士,主要研究方向为生物信息学,E-mail:cswangyang@aliyun.com
  • 作者简介:刘冬宁(1979-),男,教授,博士,主要研究方向为数据库与协同计算,E-mail:liudn@gdut.edu.cn
  • 基金资助:
    国家自然科学基金资助面上项目(62072120)

Prediction Method of Gene Methylation Sites Based on LSTM with Compound Coding Characteristics

Liu Dong-ning, Wang Zi-qi, Zeng Yan-jiao, Wen Fu-yan, Wang Yang   

  1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2022-03-23 Online:2023-01-25 Published:2023-01-12

摘要: DNA-N6甲基腺嘌呤 (6-mA) 甲基化修饰是重要的表观遗传修饰标记之一。异常的6-mA位点会影响基因表达,进而引发多种重大疾病,因此预测6-mA位点对理解治病机理和治疗疾病具有重要意义。提出一种基于K-mer方法和One-hot方法复合特征编码的长短期记忆 (Long Short-Term Memory,LSTM) 神经网络用于基因甲基化位点预测,通过K-mer编码方法增加基因序列字符信息量,再使用One-hot编码方法对编码后的字符序列进行扩展,形成复合编码矩阵。改进后的序列编码矩阵可增加LSTM模型从基因序列数据中可提取的特征维度和种类,以提高LSTM模型对基因序列的处理性能。通过交叉验证实验表明本方法在公共数据集上的准确率可达93.7%,敏感度、特异性和马氏相关系数分别为93.0%、94.5%、0.875,均优于现有方法。进一步,在其他6个不同物种的基因数据集上,受试者工作特征曲线线下面积 (Area Under the Curve,AUC) 值介于0.9055~0.9262,表明本方法可适用于动物、植物和微生物的甲基化位点预测。本方法对水稻NC_029258.1基因序列进行全碱基位点的预测,经4种不同的在线工具校验,本方法预测出的86%~96%的潜在甲基化位点在其他工具中也获得相似结论,预测结论可靠,可应用于基因序列甲基化位点的预测分析工作。

关键词: 甲基化位点预测, 深度学习, 长短时记忆网络, 复合特征

Abstract: DNA-N6 methyladenine (6-mA) methylation modification is one of the most important epigenetic modification markers. The aberrant 6-mA modification can affect gene expression and lead to serious diseases. Therefore, the work of predicting the 6-mA site is of great significance for the understanding of the pathogenesis and treatment of diseases. In this paper, a long short-term memory (LSTM) neural network based on K-mer encoding method and one hot encoding method is proposed to predict methylation sites.Firstly, the information content of gene sequence is increased through K-mer coding method. Secondly, the information content after one hot encoding is converted into a composite encoding matrix. The LSTM model can extract more feature dimensions and types from the encoding matrix, to improve the prediction performance of the LSTM model for gene sequence. The cross validation experiment show that the proposed method can achieve accuracy of 93.7% on benchmark datasets. The sensitivity, specificity and matthews correlation coefficient of the trained model were 93.0%, 94.5% and 0.875, which outperformed existing 6-mA prediction methods. On the other six different species datasets, the proposed method can achieve the area under the curve (AUC) values from 0.9055 to 0.9262,which shows the applicability of the proposed method on animals, plants and microorganisms methylation tasks. The proposed method was applied on rice gene NC_ 029258.1, and the predictions were verified by the recently published online prediction tools. The results show that 86% to 96% of the prediction results are supported by these tools, indicating that the proposed method can be applied to large-scale site prediction and analysis of different species.

Key words: methylation site prediction, deep learning, long short-term memory network, compound features

中图分类号: 

  • TP301.6
[1] KULIS M, ESTELLER M. DNA methylation and cancer [J]. Advances in Genetics, 2010, 70(10): 27-56.
[2] ROBERTSON, KEITH D. DNA methylation and human disease [J]. Nature Reviews Genetics, 2005, 6(8): 597-610.
[3] LOPEZ-SERRA P, ESTELLER M. DNA methylation-associated silencing of tumor-suppressor micro-RNAs in cancer [J]. Oncogene, 2012, 31(13): 1609-1622.
[4] LYU H, DAO F Y, ZHANG D, et al. Advances in mapping the epigenetic modifications of 5-methylcytosine (5mC), N6-methyladenine (6mA), and N4‐methylcytosine (4mC) [J]. Biotechnology and Bioengineering, 2021, 118(11): 4204-4216.
[5] DAY J J, CHILDS D, GUZMAN-KARLSSON M C, et al. DNA methylation regulates associative reward learning [J]. Nature Neuroscience, 2013, 16(10): 1445-1452.
[6] YANG X J, LAY D F, HAN H, et al. Targeting DNA methylation for epigenetic therapy [J]. Trends Pharmacol Sci, 2010, 31(11): 536-546.
[7] MEISSNER A, MIKKELSEN T S, GU H C. Genome-scale DNA methylation maps of pluripotent and differentiated cells [J]. Nature, 2008, 454: 766-770.
[8] LIANG Z, SHEN L S, CUI X A, et al. DNA N6-adenine methylation in arabidopsis thaliana [J]. Developmental Cell, 2018, 45(3): 406-416.
[9] LIU M C, OXNARD G R, KLEIN E A, et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA [J]. Ann Oncol, 2020, 31: 745-759.
[10] CATANIA S, PHILLIP A D, HAROLD P, et al. Evolutionary persistence of DNA methylation for millions of years after ancient loss of a de novo methyltransferase [J]. Cell, 2020, 180(20): 263-277.
[11] CHAI P W, YU J, GE S F, et al. Genetic alteration, RNA expression, and DNA methylation profiling of coronavirusdisease 2019 (COVID-19) receptor ACE2 in malignancies: a pan-cancer analysis [J]. Journal of Hematology Oncol, 2020, 13: 1-5.
[12] IZZO F, LEE S C, PORAN A, et al. DNA methylation disruption reshapes the hematopoietic differentiation landscape [J]. Nature Genetics, 2020, 52(4): 1-10.
[13] JOSÉ A E, MENENDEZ J A. Potential drugs targetingearly innate immune evasion of SARS-coronavirus 2 via 2'-O-methylation of viral RNA [J]. Viruses, 2020, 12(5): 525.
[14] YANG J L, LANG K, ZHANG G L, et al. SOMM4mC: a second-order markov model for DNA N4-methylcytosine site prediction in six species [J]. Bioinformatics, 2020, 36(14): 4103-4105.
[15] KRAIS A M, CORNELIUS M G, SCHMEISER H H. Genomic N6- methyladenine determination by MEKC with LIF [J]. Electrophoresis, 2010, 31(21): 3548-3551.
[16] SMITH Z D, MEISSNER A. DNA methylation: roles in mammalian development [J]. Nature Reviews Genetics, 2013, 14(3): 204-220.
[17] LUO G Z, WANG F, WENG X C, et al. Characterization of eukaryotic DNA N6-methyladenine by a highly sensitive restriction enzyme-assisted sequencing [J]. Nature Communications, 2016, 7(1): 1-6.
[18] ZHANG G Q, HUANG H, LIU D, et al. N6-methyladenine DNA modification in Drosophila [J]. Cell, 2015, 161: 893-906.
[19] FANG G, MUNERA D, FRIEDMAN D I, et al. Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing [J]. Nature Biotechnology, 2012, 30(12): 1232-1239.
[20] BHASIN M, ZHANG H, REINHERZ E L, et al. Prediction of methylated CpGs in DNA sequences using a support vector machine [J]. FEBS Letters, 2005, 579(20): 4302-4308.
[21] ZHANG Q Y, AIRES-DE-SOUSA J. Random forest prediction of mutagenicity from empirical physicochemical descriptors [J]. Journal of Chemical Information and Modeling, 2007, 47(1): 1-8.
[22] FENG P M, CHEN W, LIN H. Prediction of CpG island methylation status by integrating DNA physicochemical propertyes [J]. Genomics, 2014, 104(4): 229-233.
[23] YU H, WANG S, LEE X R, et al. Algorithm study of real-time detection of sleep apnea-hypopnea event based on long-short term memory-convolutional neural network [J]. Chinese Journal Biomedical Engineering, 2020, 39(3): 303-310.
[24] AMIN R, RAHMAN C R, SHATABDA S, et al. i6mA-CNN: a convolution based computational approach towards identification of DNA N6-methyladenine sites in rice genome [J]. Sci Rep, 2020, 11(1): 10458.
[25] WANG Y J, HUANG F L, HUANG S, et al. Breast cancer image classification based on fusion multi-network deep convolution features and sparse double relation regularization method [J]. Chinese Journal Biomedical Engineering, 2020, 39(5): 532-540.
[26] TIAN Q, ZOU J X, TANG J X, et al. MRCNN: a deep learning model for regression of genome-wide DNA methylation [J]. BMC Genomics, 2019, 20(2): 1-10.
[27] ZENG H Y, GIFFORD D K. Predicting the impact of non-coding variants on DNA methylation [J]. Nucleic Acids Research, 2017(11): 11.
[28] ANGERMUELLER C, LEE H J, REIK W, et al. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning [J]. Genome Biol, 2017, 18(1): 1-13.
[29] HASAN M M, BASITH S, SHAMIMA K M, et al. Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework [J]. Brief Bioinform, 2020, 22(3): bbaa202.
[30] HASAN M M, MANAVALAN B, SHOOMBUATONG W, et al. i6mA-Fuse: improved and robust prediction of DNA 6mA sites in the Rosaceae genome by fusing multiple feature representation [J]. Plant Molecular Biology, 2020, 103(1): 225-234.
[31] CONG P, ZHANG G G, LI F, et al. MM-6mAPred: identifying DNA N6-methyladenine sites based on Markov model [J]. Bioinformatics, 2019, 36(2): 388-392.
[32] BASITH S, MANAVALAN B, SHIN T H, et al. SDM6A: a web-based integrative machine-learning framework for predicting 6mA sites in the rice genome [J]. Molecular Therapy-Nucleic Acids, 2019, 18: 131-141.
[33] LYU H, DAO F Y, GUAN Z X, et al. iDNA6mA-Rice: a computational tool for detecting N6-methyladenine sites-in rice [J]. Frontiers in Genetics, 2019(10): 793.
[34] XUH D, HUR F, JIAP L, et al. 6mA-Finder: a novelonline tool for predicting DNA N6-methyladenine sites in genomes [J]. Bioinformatics, 2020, 36(10): 3257-3259.
[35] CHEN W, LYU H, NIE F L, et al. i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome [J]. Bioinformatics, 2019, 35(11): 2796-2800.
[36] CHENG M, SHU X, CAO J, et al. A mutation-based method for pinpointing a DNA N6-methyladenine methyltransferase's modification site at single base resolution [J]. Chem Bio Chem, 2021, 22(11): 1936-1939.
[37] LEE H K, BARBAROSIE M, KAMEYAMA K, et al. Regulation of distinct AMPA receptor phosphorylation sites during bidirectional synaptic plasticity [J]. Nature, 2000, 405(6789): 955-978.
[38] XUE Y, ZHOU F F, ZHU M J, et al. GPS: a comprehensive www server for phosphorylation sites prediction [J]. Nucleic Acids Research, 2005, 33: 184-187.
[39] KIM J H, LEE J, OH B, et al. Prediction of phosphorylation sites using SVMs [J]. Bioinformatics, 2004, 20(17) : 3179-3184.
[40] ZHU Q L, LI X L, CONESA A, et al. GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text [J]. Bioinformatics, 2017, 34(9) : 1547-1554.
[1] 吴俊贤, 何元烈. 基于通道注意力的自监督深度估计方法[J]. 广东工业大学学报, 2023, 40(02): 22-29.
[2] 徐伟锋, 蔡述庭, 熊晓明. 基于深度特征的单目视觉惯导里程计[J]. 广东工业大学学报, 2023, 40(01): 56-60,76.
[3] 刘洪伟, 林伟振, 温展明, 陈燕君, 易闽琦. 基于MABM的消费者情感倾向识别模型——以电影评论为例[J]. 广东工业大学学报, 2022, 39(06): 1-9.
[4] 章云, 王晓东. 基于受限样本的深度学习综述与思考[J]. 广东工业大学学报, 2022, 39(05): 1-8.
[5] 郑佳碧, 杨振国, 刘文印. 基于细粒度混杂平衡的营销效果评估方法[J]. 广东工业大学学报, 2022, 39(02): 55-61.
[6] Gary Yen, 栗波, 谢胜利. 地球流体动力学模型恢复的长短期记忆网络渐进优化方法[J]. 广东工业大学学报, 2021, 38(06): 1-8.
[7] 赖峻, 刘震宇, 刘圣海. 基于全局数据混洗的小样本数据预测方法[J]. 广东工业大学学报, 2021, 38(03): 17-21.
[8] 岑仕杰, 何元烈, 陈小聪. 结合注意力与无监督深度学习的单目深度估计[J]. 广东工业大学学报, 2020, 37(04): 35-41.
[9] 曾碧, 任万灵, 陈云华. 基于CycleGAN的非配对人脸图片光照归一化方法[J]. 广东工业大学学报, 2018, 35(05): 11-19.
[10] 陈旭, 张军, 陈文伟, 李硕豪. 卷积网络深度学习算法与实例[J]. 广东工业大学学报, 2017, 34(06): 20-26.
[11] 刘震宇, 李嘉俊, 王昆. 一种基于深度自编码器的指纹匹配定位方法[J]. 广东工业大学学报, 2017, 34(05): 15-21.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!