In order to validate our choice of preprocessing pipeline we compared the count matrices from two commonly utilized scRNA-seq preprocessing pipelines CellRanger and kallisto | bustools. The main difference between these pipelines is while CellRanger uses the conventional splice aware aligner STAR, kallisto uses a pseudoalignment process. Pseudoalignment is a rapid k-mer based algorithm that uses a de Bruijn Graph of the reference database to identify potential matches for a query sequence without actually aligning the whole query sequence and uses an expectation maximization (EM) algorithm to resolve multiply-mapped reads.

For additional documentation of how we installed and ran these pipelines click the following links:
CellRanger
kallisto | bustools

Results

Figure 1. QC Metric Comparison

Figure 1. QC Metric Comparison

Figure 1 shows overall percent higher alignment to the transcriptome for kallisto and overall higher gene counts. Cells with higher gene counts are considered to be more informative for clustering and downstream analysis.

Figure 2.Notable Differences in Detection

Figure 2.Notable Differences in Detection

Note: For protein coding genes absolute differences less than 6 were removed for visualization.

Figure 2 shows the differences in gene detection between the pipelines by subtracting the average count per cell from CellRanger from the average count per cell from kallisto. Negative values indicate higher detection for CellRanger and positive values have higher detection in kallisto.

Figure 3.Notable Differences in Gene Type Counts

Figure 3.Notable Differences in Gene Type Counts

Figure 3 shows differences for the categories that had notable differences in counts for gene type categories detected for each pipeline.

Table 1.Gene Types That Higher Detection in Kallisto
Gene Types % Higher in Kallisto
Total Genes 55.76
Protein Coding 58.86
lncRNA 33.58


Table 2. Gene Families Detected by kallisto
Prefix Count
Olfr 287
Vmn 113
Slc 9
Ces 4

The results of the comparison show kallisto is the better choice for our data and captures more genes in multiple gene type categories.