Open Access Research

A comprehensive evaluation of multicategory classification methods for microbiomic data

Alexander Statnikov12*, Mikael Henaff1, Varun Narendra1, Kranti Konganti7, Zhiguo Li1, Liying Yang2, Zhiheng Pei235, Martin J Blaser246, Constantin F Aliferis138 and Alexander V Alekseyenko12*

Author Affiliations

1 Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 East 30th Street, New York, NY, USA

2 Department of Medicine, New York University School of Medicine, 550 First Ave, New York, NY, USA

3 Department of Pathology, New York University School of Medicine, 550 First Ave, New York, NY, USA

4 Department of Microbiology, New York University School of Medicine, 550 First Ave, New York, NY, USA

5 Department of Pathology and Laboratory Medicine, Department of Veterans Affairs New York Harbor Healthcare System, 423 East 23rd Street, New York, NY, USA

6 Medical Service, Department of Veterans Affairs New York Harbor Healthcare System, 423 East 23rd Street, New York, NY, USA

7 Whole Systems Genomics Initiative, Texas A&M University, Kleberg Center, Mail Stop 2470, College Station, TX, USA

8 Department of Biostatistics, Vanderbilt University, 1161 21st Ave South, Nashville, TN, USA

For all author emails, please log on.

Microbiome 2013, 1:11  doi:10.1186/2049-2618-1-11

Published: 5 April 2013

Abstract

Background

Recent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data.

Results

In this work, we performed a systematic comparison of 18 major classification methods, 5 feature selection methods, and 2 accuracy metrics using 8 datasets spanning 1,802 human samples and various classification tasks: body site and subject classification and diagnosis.

Conclusions

We found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data.

Keywords:
Microbiomic data; Machine learning; Classification; Feature selection