To process large-scale single-cell RNA-sequencing (scRNA-seq) data effectively without excessive distortion during dimension reduction, University of Pennsylvania researchers have develped SHARP, an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. Comprehensive benchmarking tests on 17 public scRNA-seq datasets demonstrate that SHARP outperforms existing methods in terms of speed and accuracy. Particularly, for large-size datasets (>40,000 cells), SHARP runs faster than other competitors while maintaining high clustering accuracy and robustness. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering scRNA-seq data with 10 million cells.
The framework of SHARP
(a) SHARP has 4 steps for clustering: divide-and-conquer, random projection (RP), weighted-based meta clustering, and similarity-based meta-clustering. (b) Running time and (c) clustering performance based on ARI (Hubert and Arabie 1985) of SHARP in 20 single-cell RNA-seq datasets with numbers of single cells ranging from 124 to 10 million (where datasets with 2 million, 5 million and 10 million cells were generated by randomly oversampling the dataset with 1.3 million single cells). For the datasets with >1 million cells, only SHARP can run and only the running time was provided due to lack of the ground-truth clustering labels. All of the results for SHARP were based on 100 runs of SHARP on each dataset. All the tests except for the larger-than-1-million-cell datasets were performed using a single core on an Intel Xeon CPU E5-2699 v4 @ 2.20GHz system with 500GB memory. To run datasets with > 1 million cells, we used 16 cores on the same system. CIDR and Phenograph were unable to produce clustering results for those datasets with number of cells larger than 40,000 (i.e., Park, Macosko and Montoro_large).