Searching for patterns of thermostability in proteins and defining the main features contributing to enzyme thermostability through screening, clustering, and decision tree algorithms

Ebrahimie, E.; Ebrahimi, M.

Searching for patterns of thermostability in proteins and defining the main features contributing to enzyme thermostability through screening, clustering, and decision tree algorithms

dc.contributor.author	Ebrahimie, E.
dc.contributor.author	Ebrahimi, M.
dc.contributor.author	Ebrahimi, M.
dc.date.accessioned	2010-02-09T14:11:20Z
dc.date.available	2010-02-09T14:11:20Z
dc.date.issued	2010-02-09T14:11:20Z
dc.description.abstract	Finding or making thermostable enzymes has been identified as an important goal in a number of different industries. Therefore, understanding the features involved in enzyme thermostability is crucial, and different approaches have been used to extract or manufacture thermostable enzymes. Herein we examined features that contribute to the thermostability of 2,946 proteins. We used various screening techniques (anomaly detection, feature selection), clustering methods (K-Means, TwoStep cluster), decision tree models (Classification and Regression Tree, CHAID, Exhaustive CHAID, QUEST, C5.0), and generalized rule induction (association) (GRI) models to search for patterns of thermostability and to find features that contribute to enzyme thermal stability. We found that Arg as the N-terminal amino acid was found solely in proteins working at temperatures higher than 70 ºC. Fifty-four protein features were shown to be important in feature selection modeling, and the number of peer groups with an anomaly index of 2.12 declined from 7 to 2 after being run using only important selected features; however, no changes were found in the numbers of groups when K-Means and TwoStep clustering modeling was performed on datasets with/without feature selection filtering. The depth of the trees generated by various decision tree models varied from 14 (in the C5.0 model with 10-fold cross-validation and with feature selection of the dataset) to 4 (in CHAID models) branches. The performance evaluation of the decision tree models tested here showed that C5.0 was the best and the Quest model was the worst. We did not find any significant difference in the percent of correctness, performance evaluation, and mean correctness of various decision tree models when feature selected datasets were used, but the number of peer groups in clustering models was reduced significantly (p<0.05) compared to datasets without feature selection. In all decision tree models, the frequency of Gln was the most important feature for decision tree rule sets; moreover, in all GRI association rules (100 rules), the frequency of Gln was used in antecedent to support the rules. The importance of Gln in protein thermostability is discussed in this paper.	en
dc.identifier.uri	http://hdl.handle.net/2003/26684
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-12747
dc.language.iso	en
dc.relation.ispartofseries	EXCLI Journal ; Vol. 8, 2009	en
dc.subject	bioinformatics	en
dc.subject	modeling	en
dc.subject	protein	en
dc.subject	thermostability	en
dc.subject.ddc	610
dc.title	Searching for patterns of thermostability in proteins and defining the main features contributing to enzyme thermostability through screening, clustering, and decision tree algorithms	en
dc.type	Text
dc.type.publicationtype	article
dcterms.accessRights	open access
eldorado.dnb.deposit	true
eldorado.dnb.zdberstkatid	2132560-1

Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1

Name:: Ebrahimi_proof_010909.pdf
Größe:: 227.55 KB
Format:: Adobe Portable Document Format
Beschreibung:: DNB

Herunterladen

Lizenzbündel

Gerade angezeigt 1 - 1 von 1

Name:: license.txt
Größe:: 1019 B
Format:: Item-specific license agreed upon to submission
Beschreibung:

Herunterladen

Sammlungen

Original Articles