Systems Engineering and Electronics

Previous Articles     Next Articles

Computation of document similarity based on metadata and domain concept tree

ZHANG Peiyun1,2, CHEN Enhong2, XIE Rongjian3, CONG Xiuwen1, HUANG Bo4   

  1. (1.School of Mathematics and Computer Science, Anhui Normal University, Wuhu 241003, China; 
    2.School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; 
    3.School of Management, University of Science and Technology of China, Hefei 230026, China;
    4.School of Computer Science & Technology, Nanjing University of Science and Technology, Nanjing 210094, China)
  • Online:2014-03-24 Published:2010-01-03

Abstract:

With the rapid development of network and information technology, a large number of electronic documents appear on the network, and the similarity computaion between the documents is an important means of document processing. For large-scale collection of documents, vector space model (VSM) is usually used for document representation, but the method is facing the problems of higher dimension and lack of semantic similarity. An improved method for calculating the similarity of document is proposed. Metadata feature vectors are selected from a large number of representative feature space, so that it can reduce the dimension of the vector space. The domain concept tree is constructed and the algorithm for computing document similarity is designed. In order to improve the document semantic similarity of algorithm performance, the synonym concepts which exist in widespread areas are processed. The experimental results show that the proposed method can improve the performance of document similarity computation based on the dimensionality reduction and the concepts similarity computing.

[an error occurred while processing this directive]