Systems Biology at Pacific Northwest National LaboratorySystems Biology Home
Systems Biology Home
Skip NavigationLaboratory R and DCapabilitiesResearch StaffShared Data Resources

Biology at PNNL

Systems Biology Home
Multicellular Networks
Oxidative Stress and Radiation
Cell Signaling
Network Biology
Biomarkers
Environmental Science
Biofilms
Shewanella Federation

Resources

Seminars and Workshops
News and Publications
Education and Training Opportunities
Tutorials
Links
Printer IconPrint This Page

SVM-HUSTLE

Anuj Shah, Principal Investigator

Related Projects


Visit the SVM-HUSTLE software page to download the software.

SVM-Hustle flow chart

As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular parts list. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. Scientist at Pacific Northwest National Laboratory (PNNL) have developed a Support Vector Machine (SVM)-based tool to detect Homology Using Semi-supervised iTerative LEarning (SVM-HUSTLE) that identifies significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position-specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative high-confidence training sets, recruits additional sequences and assigns a statistical measure of homology between a pair of sequences. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine, on-the-fly, patterns indicating homology.

Results: When compared against existing methods for identifying protein homologs (BLAST, PSI-BLAST, COMPASS, PROF_SIM, RANKPROP and their variants) on two different benchmark datasets SVM-HUSTLE significantly outperforms each of the methods using the most stringent ROC1 statistic with p-values less than 1e-20. SVM-HUSTLE also yields results comparable to HHSearch but at a substantially reduced computational cost since we do not require the construction of HMMs.