In Supplemental Figure 8 of the pre-print, we classified single cell data for a model organism (cynomolgus monkey) for which flow-sorted datasets were generally lacking without any additional species-specific training. Instead, we mapped homologous genes from the Macaca fascicularis genome to the human genome in the single cell data, and then performed cell type classification with Signac. We demonstrate how we mapped the gene symbols in this vignette.

Note: * This code can be used for to identify homologous genes between any two species. * Monkey data used in Supplemental Figure 8 are available for interactive exploration in the table listed above.

This vignette shows how to map homologous gene symbols from Macaca fascicularis to the human genome.

Load the essential packages

require(biomaRt)
require(tidyverse)
require(SignacX)

After mapping the reads to the Macaca fasccicularis genome, we load the genes, which were generated from the output of the cellranger pipeline from 10X Genomics.

Load Macaca fascicularis genes

features.tsv <- read.delim("fls/features.tsv.gz", header = FALSE, stringsAsFactors = FALSE)
head(features.tsv)
##                   V1                 V2              V3
## 1 ENSMFAG00000044637              PGBD2 Gene Expression
## 2 ENSMFAG00000039056             ZNF692 Gene Expression
## 3 ENSMFAG00000030010             ZNF672 Gene Expression
## 4 ENSMFAG00000002737            SH3BP5L Gene Expression
## 5 ENSMFAG00000000508              LYPD8 Gene Expression
## 6 ENSMFAG00000040572 ENSMFAG00000040572 Gene Expression

Map homologous genes

# get human and cyno gene symbols
human.R95 <- useMart(host = "jan2019.archive.ensembl.org", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
cyno.R95 <- useMart(host = "jan2019.archive.ensembl.org", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "mfascicularis_gene_ensembl")

# values = listeENSID: list of cynomolgus ensembl IDs to be retrieved.
listeENSID = features.tsv$V1
orthologs <- getLDS(attributes = c("ensembl_gene_id", "external_gene_name"), filters = "ensembl_gene_id", 
    values = listeENSID, mart = cyno.R95, attributesL = c("hgnc_symbol", "ensembl_gene_id"), martL = human.R95)
orthologs <- as_tibble(orthologs)
colnames(orthologs) <- c("GeneID", "cynoSymbol", "HumanSymbol", "HumanGeneID")

# keep only 1:1 orthologs
one2one <- orthologs %>% group_by(GeneID) %>% summarise(n()) %>% filter(`n()` <= 1) %>% dplyr::select(GeneID) %>% 
    pull()
orthologs <- orthologs %>% filter(GeneID %in% one2one)

# replace empty HumanSymbol (where there isn't a gene name for a homologous gene) with NA
orthologs <- orthologs %>% mutate(HumanSymbol = replace(HumanSymbol, HumanSymbol == "", NA))
orthologs <- orthologs %>% mutate(cynoSymbol = replace(cynoSymbol, cynoSymbol == "", NA))

idx = match(listeENSID, orthologs$GeneID)
xx = orthologs$HumanSymbol[idx]
logik = !is.na(orthologs$HumanSymbol[idx])  # sum(logik) returns 17,365 homologous genes
xx = xx[logik]
orthologs = orthologs[!is.na(orthologs$HumanSymbol), ]
# note: several of these genes are not unique mappers; we will aggregate them later or make them
# unique. To aggregate, where E is the sparse expression matrix with rownames set to xx: E =
# Matrix.utils::aggregate.Matrix(E, row.names(E))

Now we have mapped homologous gene symbols across species:

head(orthologs)
##               GeneID cynoSymbol HumanSymbol     HumanGeneID
## 1 ENSMFAG00000046426        ND6      MT-ND6 ENSG00000198695
## 2 ENSMFAG00000002805      POLG2       POLG2 ENSG00000256525
## 3 ENSMFAG00000046418       COX2      MT-CO2 ENSG00000198712
## 4 ENSMFAG00000042657    SLC38A3     SLC38A3 ENSG00000188338
## 5 ENSMFAG00000042891      HMOX1       HMOX1 ENSG00000100292
## 6 ENSMFAG00000038079     SHISA9      SHISA9 ENSG00000237515

After mapping homologous genes, Signac can be used to classify the cell types.

Session Info
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.31        magrittr_2.0.1    R6_2.5.0          ragg_1.1.1       
##  [5] rlang_0.4.10      fastmap_1.1.0     stringr_1.4.0     tools_4.0.0      
##  [9] xfun_0.21         jquerylib_0.1.3   htmltools_0.5.1.1 systemfonts_1.0.1
## [13] yaml_2.2.1        assertthat_0.2.1  digest_0.6.27     rprojroot_2.0.2  
## [17] pkgdown_1.6.1     crayon_1.4.1      textshaping_0.3.1 formatR_1.7      
## [21] sass_0.3.1        fs_1.5.0          memoise_2.0.0     cachem_1.0.3     
## [25] evaluate_0.14     rmarkdown_2.7     stringi_1.5.3     compiler_4.0.0   
## [29] bslib_0.2.4       desc_1.2.0        jsonlite_1.7.2