CITE-seq Workflow: RNA + Protein with Real CBMC Data

Introduction

CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) simultaneously profiles RNA and surface protein (antibody-derived tags, ADT) from the same cells, enabling a more complete picture of cell identity than RNA alone. Surface protein levels complement transcriptomics by capturing post-transcriptional regulation and providing sharper separation for immunophenotyping.

This article shows the full CITE-seq workflow using the CBMC dataset – cord blood mononuclear cells profiled with 13 antibodies – and demonstrates how scConvert exports the multi-modal result to h5mu for downstream use in Python’s muon ecosystem.

1 Load CBMC data

readH5Seurat() reads the CBMC h5Seurat file and reconstructs both the RNA and ADT assays. The ADT assay name varies between datasets; the code below detects it automatically.

cbmc <- readH5Seurat("../cbmc_rna.h5seurat")
cat(sprintf("Cells: %d\n", ncol(cbmc)))
#> Cells: 8617
cat(sprintf("Assays: %s\n", paste(names(cbmc@assays), collapse = ", ")))
#> Assays: RNA, ADT
cat(sprintf("RNA features: %d\n", nrow(cbmc[["RNA"]])))
#> RNA features: 20501

adt_name <- names(cbmc@assays)[names(cbmc@assays) != "RNA"][1]
if (!is.na(adt_name)) {
  cat(sprintf("ADT assay '%s': %d features\n", adt_name, nrow(cbmc[[adt_name]])))
}
#> ADT assay 'ADT': 10 features

2 RNA analysis pipeline

We run the standard Seurat single-cell workflow on the RNA assay to obtain clusters and a UMAP layout before integrating protein data.

DefaultAssay(cbmc) <- "RNA"
set.seed(42L)
cbmc <- NormalizeData(cbmc, verbose = FALSE)
cbmc <- FindVariableFeatures(cbmc, nfeatures = 2000L, verbose = FALSE)
cbmc <- ScaleData(cbmc, verbose = FALSE)
cbmc <- RunPCA(cbmc, npcs = 30L, verbose = FALSE)
cbmc <- FindNeighbors(cbmc, dims = 1:20, verbose = FALSE)
cbmc <- FindClusters(cbmc, resolution = 0.5, verbose = FALSE)
cbmc <- RunUMAP(cbmc, dims = 1:20, verbose = FALSE)

DimPlot(cbmc, reduction = "umap", group.by = "seurat_clusters",
        label = TRUE, label.size = 4, pt.size = 0.5) +
  ggtitle("CBMC: RNA clusters") +
  theme(plot.title = element_text(hjust = 0.5))

CBMC UMAP coloured by RNA-based Seurat clusters.

3 Protein (ADT) analysis

CLR (centered log-ratio) normalization is the standard approach for ADT data. It accounts for compositional effects and differences in antibody capture efficiency.

DefaultAssay(cbmc) <- adt_name
cbmc <- NormalizeData(cbmc, normalization.method = "CLR", margin = 2,
                      verbose = FALSE)

adt_features <- rownames(cbmc[[adt_name]])
plot_features <- head(adt_features, 4L)

FeaturePlot(cbmc, features = plot_features, ncol = 2,
            reduction = "umap", pt.size = 0.3,
            order = TRUE)

ADT protein levels on the RNA UMAP.

DefaultAssay(cbmc) <- "RNA"
first_adt <- plot_features[1]
DefaultAssay(cbmc) <- adt_name

VlnPlot(cbmc, features = first_adt, group.by = "seurat_clusters",
        pt.size = 0) +
  ggtitle(sprintf("%s protein across RNA clusters", first_adt)) +
  NoLegend()

ADT expression of the first marker across RNA clusters.

4 WNN multi-modal integration

Weighted Nearest Neighbor (WNN) integrates RNA and ADT by learning per-cell modality weights. Cells where protein provides additional discriminatory power receive higher ADT weight; cells where RNA is more informative are RNA-dominant. The resulting WNN UMAP captures cell state more accurately than either modality alone.

DefaultAssay(cbmc) <- "RNA"
set.seed(42L)
cbmc <- RunPCA(cbmc, verbose = FALSE)

DefaultAssay(cbmc) <- adt_name
VariableFeatures(cbmc) <- rownames(cbmc[[adt_name]])
cbmc <- ScaleData(cbmc, verbose = FALSE)
n_adt <- nrow(cbmc[[adt_name]])
adt_dims <- min(18L, n_adt - 1L)
cbmc <- RunPCA(cbmc, features = rownames(cbmc[[adt_name]]),
               reduction.name = "apca", verbose = FALSE)

cbmc <- FindMultiModalNeighbors(
  cbmc,
  reduction.list = list("pca", "apca"),
  dims.list      = list(1:30, seq_len(adt_dims)),
  verbose        = FALSE
)

cbmc <- RunUMAP(
  cbmc,
  nn.name       = "weighted.nn",
  reduction.name = "wnn.umap",
  reduction.key  = "wnnUMAP_",
  verbose        = FALSE
)

DimPlot(cbmc, reduction = "wnn.umap", label = TRUE,
        label.size = 4, pt.size = 0.5) +
  ggtitle("CBMC: WNN UMAP (RNA + ADT)") +
  theme(plot.title = element_text(hjust = 0.5))

WNN UMAP integrating RNA and ADT modalities.

Note: WNN UMAP integrates both modalities for a more accurate representation of cell state. Clusters that are indistinguishable by RNA alone may separate when surface protein levels are incorporated.

5 Export to h5mu

writeH5MU() writes the full multi-modal Seurat object to the MuData (.h5mu) format, preserving both RNA and ADT assays in a single file. This is the recommended format for sharing CITE-seq data with Python collaborators.

DefaultAssay(cbmc) <- "RNA"
h5mu_path <- file.path(tempdir(), "cbmc_wnn.h5mu")

t0      <- proc.time()
writeH5MU(cbmc, h5mu_path, overwrite = TRUE)
elapsed <- (proc.time() - t0)[["elapsed"]]

cat(sprintf("Wrote h5mu: %.2fs | %.1f MB\n",
            elapsed, file.size(h5mu_path) / 1e6))
#> Wrote h5mu: 2.27s | 51.9 MB

6 Python validation

We read the exported h5mu in Python via mudata to confirm both modalities are intact and correctly structured.

library(reticulate)
Sys.setenv(NUMBA_THREADING_LAYER = "tbb", OMP_NUM_THREADS = "1")
use_condaenv("scverse")

import mudata
import matplotlib
matplotlib.use('Agg')

mdata = mudata.read_h5mu(r.h5mu_path)
print(mdata)
#> MuData object with n_obs × n_vars = 8617 × 20511
#>   obs:   'nCount_ADT', 'nFeature_ADT', 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'rna_annotations', 'protein_annotations', 'RNA_snn_res.0.5', 'seurat_clusters', 'RNA.weight', 'ADT.weight'
#>   2 modalities
#>     rna: 8617 x 20501
#>       var:   'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
#>       obsm:  'X_pca', 'X_umap'
#>       layers:    'counts'
#>       obsp:  'nn', 'snn', 'wknn', 'wsnn'
#>     prot:    8617 x 10
#>       var:   'var.features', 'var.features.rank'
#>       obsm:  'X_apca', 'X_wnn.umap'
#>       layers:    'counts'
for name, mod in mdata.mod.items():
    print(f"  {name}: {mod.n_obs} cells x {mod.n_vars} features")
#>   rna: 8617 cells x 20501 features
#>   prot: 8617 cells x 10 features

7 Cleanup

unlink(h5mu_path)