CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) simultaneously profiles RNA and surface protein (antibody-derived tags, ADT) from the same cells, enabling a more complete picture of cell identity than RNA alone. Surface protein levels complement transcriptomics by capturing post-transcriptional regulation and providing sharper separation for immunophenotyping.
This article shows the full CITE-seq workflow using the CBMC dataset – cord blood mononuclear cells profiled with 13 antibodies – and demonstrates how scConvert exports the multi-modal result to h5mu for downstream use in Python’s muon ecosystem.
readH5Seurat() reads the CBMC h5Seurat file and
reconstructs both the RNA and ADT assays. The ADT assay name varies
between datasets; the code below detects it automatically.
cbmc <- readH5Seurat("../cbmc_rna.h5seurat")
cat(sprintf("Cells: %d\n", ncol(cbmc)))
#> Cells: 8617
cat(sprintf("Assays: %s\n", paste(names(cbmc@assays), collapse = ", ")))
#> Assays: RNA, ADT
cat(sprintf("RNA features: %d\n", nrow(cbmc[["RNA"]])))
#> RNA features: 20501
adt_name <- names(cbmc@assays)[names(cbmc@assays) != "RNA"][1]
if (!is.na(adt_name)) {
cat(sprintf("ADT assay '%s': %d features\n", adt_name, nrow(cbmc[[adt_name]])))
}
#> ADT assay 'ADT': 10 features
We run the standard Seurat single-cell workflow on the RNA assay to obtain clusters and a UMAP layout before integrating protein data.
DefaultAssay(cbmc) <- "RNA"
set.seed(42L)
cbmc <- NormalizeData(cbmc, verbose = FALSE)
cbmc <- FindVariableFeatures(cbmc, nfeatures = 2000L, verbose = FALSE)
cbmc <- ScaleData(cbmc, verbose = FALSE)
cbmc <- RunPCA(cbmc, npcs = 30L, verbose = FALSE)
cbmc <- FindNeighbors(cbmc, dims = 1:20, verbose = FALSE)
cbmc <- FindClusters(cbmc, resolution = 0.5, verbose = FALSE)
cbmc <- RunUMAP(cbmc, dims = 1:20, verbose = FALSE)
DimPlot(cbmc, reduction = "umap", group.by = "seurat_clusters",
label = TRUE, label.size = 4, pt.size = 0.5) +
ggtitle("CBMC: RNA clusters") +
theme(plot.title = element_text(hjust = 0.5))
CBMC UMAP coloured by RNA-based Seurat clusters.
CLR (centered log-ratio) normalization is the standard approach for ADT data. It accounts for compositional effects and differences in antibody capture efficiency.
DefaultAssay(cbmc) <- adt_name
cbmc <- NormalizeData(cbmc, normalization.method = "CLR", margin = 2,
verbose = FALSE)
adt_features <- rownames(cbmc[[adt_name]])
plot_features <- head(adt_features, 4L)
FeaturePlot(cbmc, features = plot_features, ncol = 2,
reduction = "umap", pt.size = 0.3,
order = TRUE)
ADT protein levels on the RNA UMAP.
DefaultAssay(cbmc) <- "RNA"
first_adt <- plot_features[1]
DefaultAssay(cbmc) <- adt_name
VlnPlot(cbmc, features = first_adt, group.by = "seurat_clusters",
pt.size = 0) +
ggtitle(sprintf("%s protein across RNA clusters", first_adt)) +
NoLegend()
ADT expression of the first marker across RNA clusters.
Weighted Nearest Neighbor (WNN) integrates RNA and ADT by learning per-cell modality weights. Cells where protein provides additional discriminatory power receive higher ADT weight; cells where RNA is more informative are RNA-dominant. The resulting WNN UMAP captures cell state more accurately than either modality alone.
DefaultAssay(cbmc) <- "RNA"
set.seed(42L)
cbmc <- RunPCA(cbmc, verbose = FALSE)
DefaultAssay(cbmc) <- adt_name
VariableFeatures(cbmc) <- rownames(cbmc[[adt_name]])
cbmc <- ScaleData(cbmc, verbose = FALSE)
n_adt <- nrow(cbmc[[adt_name]])
adt_dims <- min(18L, n_adt - 1L)
cbmc <- RunPCA(cbmc, features = rownames(cbmc[[adt_name]]),
reduction.name = "apca", verbose = FALSE)
cbmc <- FindMultiModalNeighbors(
cbmc,
reduction.list = list("pca", "apca"),
dims.list = list(1:30, seq_len(adt_dims)),
verbose = FALSE
)
cbmc <- RunUMAP(
cbmc,
nn.name = "weighted.nn",
reduction.name = "wnn.umap",
reduction.key = "wnnUMAP_",
verbose = FALSE
)
DimPlot(cbmc, reduction = "wnn.umap", label = TRUE,
label.size = 4, pt.size = 0.5) +
ggtitle("CBMC: WNN UMAP (RNA + ADT)") +
theme(plot.title = element_text(hjust = 0.5))
WNN UMAP integrating RNA and ADT modalities.
Note: WNN UMAP integrates both modalities for a more accurate representation of cell state. Clusters that are indistinguishable by RNA alone may separate when surface protein levels are incorporated.
writeH5MU() writes the full multi-modal Seurat object to
the MuData (.h5mu) format, preserving both RNA and ADT assays in a
single file. This is the recommended format for sharing CITE-seq data
with Python collaborators.
DefaultAssay(cbmc) <- "RNA"
h5mu_path <- file.path(tempdir(), "cbmc_wnn.h5mu")
t0 <- proc.time()
writeH5MU(cbmc, h5mu_path, overwrite = TRUE)
elapsed <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("Wrote h5mu: %.2fs | %.1f MB\n",
elapsed, file.size(h5mu_path) / 1e6))
#> Wrote h5mu: 2.27s | 51.9 MB
We read the exported h5mu in Python via mudata to confirm both modalities are intact and correctly structured.
library(reticulate)
Sys.setenv(NUMBA_THREADING_LAYER = "tbb", OMP_NUM_THREADS = "1")
use_condaenv("scverse")
import mudata
import matplotlib
matplotlib.use('Agg')
mdata = mudata.read_h5mu(r.h5mu_path)
print(mdata)
#> MuData object with n_obs × n_vars = 8617 × 20511
#> obs: 'nCount_ADT', 'nFeature_ADT', 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'rna_annotations', 'protein_annotations', 'RNA_snn_res.0.5', 'seurat_clusters', 'RNA.weight', 'ADT.weight'
#> 2 modalities
#> rna: 8617 x 20501
#> var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
#> obsm: 'X_pca', 'X_umap'
#> layers: 'counts'
#> obsp: 'nn', 'snn', 'wknn', 'wsnn'
#> prot: 8617 x 10
#> var: 'var.features', 'var.features.rank'
#> obsm: 'X_apca', 'X_wnn.umap'
#> layers: 'counts'
for name, mod in mdata.mod.items():
print(f" {name}: {mod.n_obs} cells x {mod.n_vars} features")
#> rna: 8617 cells x 20501 features
#> prot: 8617 cells x 10 features
unlink(h5mu_path)