生命・環境科学研究室村上ゼミ

第１１回生命医薬情報学連合大会(IIBMP2022)で発表しました！

上田さつきさん(大学院1年)が、第１１回生命医薬情報学連合大会(IIBMP2022)で、“Analysis of the relationship between the distributed representations of protein sequences and their structures”という研究タイトルでポスター発表しました。

【Abstract】 In recent years, the distributed representation of protein sequences has been applied to various studies, and the effectiveness of its natural language processing technique has been reported. However, it has not been analyzed in detail how accurately such representation captures the structural or functional characteristics of proteins. Thus, we analyze the relationship between the distributed representations of protein sequences embedded by Doc2Vec, which is a method for embedding a sentence into a fixed-length vector, and their structures. In this study, we focused on only human proteins. First, a Doc2Vec model is trained on non-redundant human protein sequences obtained from UniProtKB/Swiss-Prot. Next, using this model, the human protein sequences classified in SCOPe are embedded to the fixed-length vectors. Then, principal component analysis (PCA) is performed on these vectors to analyze how the structural characteristics of their proteins are captured within these vectors. As a result of transforming the high-dimensional vectors embedded by the Doc2Vec model into a low-dimensional space using PCA, it was confirmed that these were classified by the class level of SCOPe, such as all-alpha and all-beta protein classes. In this study, we also report how parameter changes in the number of vector dimensions and the number of learning iterations affect the classification.