vignettes/Crabeating_vignette.Rmd
Crabeating_vignette.Rmd
In Supplemental Figure 8 of the pre-print, we classified single cell data for a model organism (cynomolgus monkey) for which flow-sorted datasets were generally lacking without any additional species-specific training. Instead, we mapped homologous genes from the Macaca fascicularis genome to the human genome in the single cell data, and then performed cell type classification with Signac. We demonstrate how we mapped the gene symbols in this vignette.
Note: * This code can be used for to identify homologous genes between any two species. * Monkey data used in Supplemental Figure 8 are available for interactive exploration in the table listed above.
This vignette shows how to map homologous gene symbols from Macaca fascicularis to the human genome.
After mapping the reads to the Macaca fasccicularis genome, we load the genes, which were generated from the output of the cellranger pipeline from 10X Genomics.
features.tsv <- read.delim("fls/features.tsv.gz", header = FALSE, stringsAsFactors = FALSE)
head(features.tsv)
## V1 V2 V3
## 1 ENSMFAG00000044637 PGBD2 Gene Expression
## 2 ENSMFAG00000039056 ZNF692 Gene Expression
## 3 ENSMFAG00000030010 ZNF672 Gene Expression
## 4 ENSMFAG00000002737 SH3BP5L Gene Expression
## 5 ENSMFAG00000000508 LYPD8 Gene Expression
## 6 ENSMFAG00000040572 ENSMFAG00000040572 Gene Expression
# get human and cyno gene symbols
human.R95 <- useMart(host = "jan2019.archive.ensembl.org", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
cyno.R95 <- useMart(host = "jan2019.archive.ensembl.org", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "mfascicularis_gene_ensembl")
# values = listeENSID: list of cynomolgus ensembl IDs to be retrieved.
listeENSID = features.tsv$V1
orthologs <- getLDS(attributes = c("ensembl_gene_id", "external_gene_name"), filters = "ensembl_gene_id",
values = listeENSID, mart = cyno.R95, attributesL = c("hgnc_symbol", "ensembl_gene_id"), martL = human.R95)
orthologs <- as_tibble(orthologs)
colnames(orthologs) <- c("GeneID", "cynoSymbol", "HumanSymbol", "HumanGeneID")
# keep only 1:1 orthologs
one2one <- orthologs %>% group_by(GeneID) %>% summarise(n()) %>% filter(`n()` <= 1) %>% dplyr::select(GeneID) %>%
pull()
orthologs <- orthologs %>% filter(GeneID %in% one2one)
# replace empty HumanSymbol (where there isn't a gene name for a homologous gene) with NA
orthologs <- orthologs %>% mutate(HumanSymbol = replace(HumanSymbol, HumanSymbol == "", NA))
orthologs <- orthologs %>% mutate(cynoSymbol = replace(cynoSymbol, cynoSymbol == "", NA))
idx = match(listeENSID, orthologs$GeneID)
xx = orthologs$HumanSymbol[idx]
logik = !is.na(orthologs$HumanSymbol[idx]) # sum(logik) returns 17,365 homologous genes
xx = xx[logik]
orthologs = orthologs[!is.na(orthologs$HumanSymbol), ]
# note: several of these genes are not unique mappers; we will aggregate them later or make them
# unique. To aggregate, where E is the sparse expression matrix with rownames set to xx: E =
# Matrix.utils::aggregate.Matrix(E, row.names(E))
Now we have mapped homologous gene symbols across species:
head(orthologs)
## GeneID cynoSymbol HumanSymbol HumanGeneID
## 1 ENSMFAG00000046426 ND6 MT-ND6 ENSG00000198695
## 2 ENSMFAG00000002805 POLG2 POLG2 ENSG00000256525
## 3 ENSMFAG00000046418 COX2 MT-CO2 ENSG00000198712
## 4 ENSMFAG00000042657 SLC38A3 SLC38A3 ENSG00000188338
## 5 ENSMFAG00000042891 HMOX1 HMOX1 ENSG00000100292
## 6 ENSMFAG00000038079 SHISA9 SHISA9 ENSG00000237515
After mapping homologous genes, Signac can be used to classify the cell types.
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] knitr_1.31 magrittr_2.0.1 R6_2.5.0 ragg_1.1.1
## [5] rlang_0.4.10 fastmap_1.1.0 stringr_1.4.0 tools_4.0.0
## [9] xfun_0.21 jquerylib_0.1.3 htmltools_0.5.1.1 systemfonts_1.0.1
## [13] yaml_2.2.1 assertthat_0.2.1 digest_0.6.27 rprojroot_2.0.2
## [17] pkgdown_1.6.1 crayon_1.4.1 textshaping_0.3.1 formatR_1.7
## [21] sass_0.3.1 fs_1.5.0 memoise_2.0.0 cachem_1.0.3
## [25] evaluate_0.14 rmarkdown_2.7 stringi_1.5.3 compiler_4.0.0
## [29] bslib_0.2.4 desc_1.2.0 jsonlite_1.7.2