In order to validate our choice of preprocessing pipeline we compared the count matrices from two commonly utilized scRNA-seq preprocessing pipelines CellRanger and kallisto | bustools. The main difference between these pipelines is while CellRanger uses the conventional splice aware aligner STAR, kallisto uses a pseudoalignment process. Pseudoalignment is a rapid k-mer based algorithm that uses a de Bruijn Graph of the reference database to identify potential matches for a query sequence without actually aligning the whole query sequence and uses an expectation maximization (EM) algorithm to resolve multiply-mapped reads.
For additional documentation of how we installed and ran these pipelines click the following links:
CellRanger
kallisto | bustools
Results
Figure 1 shows overall percent higher alignment to the transcriptome for kallisto and overall higher gene counts. Cells with higher gene counts are considered to be more informative for clustering and downstream analysis.
Note: For protein coding genes absolute differences less than 6 were removed for visualization.
Figure 2 shows the differences in gene detection between the pipelines by subtracting the average count per cell from CellRanger from the average count per cell from kallisto. Negative values indicate higher detection for CellRanger and positive values have higher detection in kallisto.
Figure 3 shows differences for the categories that had notable differences in counts for gene type categories detected for each pipeline.
Gene Types | % Higher in Kallisto |
---|---|
Total Genes | 55.76 |
Protein Coding | 58.86 |
lncRNA | 33.58 |
Prefix | Count |
---|---|
Olfr | 287 |
Vmn | 113 |
Slc | 9 |
Ces | 4 |
The results of the comparison show kallisto is the better choice for our data and captures more genes in multiple gene type categories.