Searching for patterns of thermostability in proteins and defining the main features contributing to enzyme thermostability through screening, clustering, and decision tree algorithms

dc.contributor.authorEbrahimie, E.
dc.contributor.authorEbrahimi, M.
dc.contributor.authorEbrahimi, M.
dc.date.accessioned2010-02-09T14:11:20Z
dc.date.available2010-02-09T14:11:20Z
dc.date.issued2010-02-09T14:11:20Z
dc.description.abstractFinding or making thermostable enzymes has been identified as an important goal in a number of different industries. Therefore, understanding the features involved in enzyme thermostability is crucial, and different approaches have been used to extract or manufacture thermostable enzymes. Herein we examined features that contribute to the thermostability of 2,946 proteins. We used various screening techniques (anomaly detection, feature selection), clustering methods (K-Means, TwoStep cluster), decision tree models (Classification and Regression Tree, CHAID, Exhaustive CHAID, QUEST, C5.0), and generalized rule induction (association) (GRI) models to search for patterns of thermostability and to find features that contribute to enzyme thermal stability. We found that Arg as the N-terminal amino acid was found solely in proteins working at temperatures higher than 70 ÂșC. Fifty-four protein features were shown to be important in feature selection modeling, and the number of peer groups with an anomaly index of 2.12 declined from 7 to 2 after being run using only important selected features; however, no changes were found in the numbers of groups when K-Means and TwoStep clustering modeling was performed on datasets with/without feature selection filtering. The depth of the trees generated by various decision tree models varied from 14 (in the C5.0 model with 10-fold cross-validation and with feature selection of the dataset) to 4 (in CHAID models) branches. The performance evaluation of the decision tree models tested here showed that C5.0 was the best and the Quest model was the worst. We did not find any significant difference in the percent of correctness, performance evaluation, and mean correctness of various decision tree models when feature selected datasets were used, but the number of peer groups in clustering models was reduced significantly (p<0.05) compared to datasets without feature selection. In all decision tree models, the frequency of Gln was the most important feature for decision tree rule sets; moreover, in all GRI association rules (100 rules), the frequency of Gln was used in antecedent to support the rules. The importance of Gln in protein thermostability is discussed in this paper.en
dc.identifier.urihttp://hdl.handle.net/2003/26684
dc.identifier.urihttp://dx.doi.org/10.17877/DE290R-12747
dc.language.isoen
dc.relation.ispartofseriesEXCLI Journal ; Vol. 8, 2009en
dc.subjectbioinformaticsen
dc.subjectmodelingen
dc.subjectproteinen
dc.subjectthermostabilityen
dc.subject.ddc610
dc.titleSearching for patterns of thermostability in proteins and defining the main features contributing to enzyme thermostability through screening, clustering, and decision tree algorithmsen
dc.typeText
dc.type.publicationtypearticle
dcterms.accessRightsopen access
eldorado.dnb.zdberstkatid2132560-1

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ebrahimi_proof_010909.pdf
Size:
227.55 KB
Format:
Adobe Portable Document Format
Description:
DNB
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1019 B
Format:
Item-specific license agreed upon to submission
Description: