Journal of Systems Engineering and Electronics ›› 2010, Vol. 32 ›› Issue (6): 1318-1324.doi: 10.3969/j.issn.1001-506X.2010.06.043

• 软件、算法与仿真 • 上一篇    下一篇

基于蛋白质二级序列的关联多分类算法

杨炳儒,周谆,侯伟   

  1. 北京科技大学信息工程学院, 北京 100083
  • 出版日期:2010-06-28 发布日期:2010-01-03

Association multi-classification algorithm based on protein secondary structure sequence

YANG Bing-ru, ZHOU Zhun, HOU Wei   

  1. Information Engineering School, Univ. of Science and Technology Beijing, Beijing 100083, China
  • Online:2010-06-28 Published:2010-01-03

摘要:

蛋白质二级结构预测是公认的生物信息学领域的国际性难题。以基于内在认知机理的知识发现理论(knowledge discovery theory based on inner cognitive mechanism, KDTICM)理论的扩展性研究与数据库中的知识发现(knowledge discovery in database*, KDD*)模型为基础,提出一种基于结构序列的多分类算法——SAC(structural association classification),可以有效地解决蛋白质二级结构预测问题。该算法借助设定支持度阈值的精化知识库的方法,其预测准确率能够超过85%。以该算法为核心,构建了一个蛋白质二级预测模型——复合金字塔模型。实验证明,在RS126、CB513、ILP数据集上的预测准确率均超过80%,超过目前已知的国际主流水平。

Abstract:

The prediction of protein secondary structure is one of the major issues in Bioinformatics. As one of the researches of KDTICM theory, a multiclassification algorithm based on structure sequence is proposed, which is based on knowledge discovery in database* (KDD*) model. The SAC algorithm can effectively solve the problem of protein secondary structure prediction. The algorithm’s accuracy exceeded by 85% by using the reduction of knowledge base through the setting of the confidence threshold value. A compound pyramid model is built with the SAC algorithm being regarded as a kernel. Experimental results show that the predictive accuracy exceeded by 80% when using in the datasets of RS126,CB513 and ILP, which is equivalent or even excels known national and international levels.