CITE-seq measures RNA and surface protein (ADT, antibody-derived tags) simultaneously from the same cells, enabling joint transcriptomic and proteomic profiling. The MuData h5mu format stores both modalities in a single file, making it straightforward to exchange multimodal datasets between Seurat (R) and muon/mudata (Python) without losing either modality or the shared cell metadata.
scConvert reads and writes h5mu without any Python dependency, relying entirely on hdf5r for HDF5 I/O.
The cbmc_rna.h5seurat file contains the cord blood
mononuclear cell (CBMC) CITE-seq dataset from Stoeckius et al. (2017):
8,617 cells with 20,501 RNA features and 10 ADT surface markers.
cbmc_path <- "../cbmc_rna.h5seurat"
t0 <- proc.time()
cbmc <- readH5Seurat(cbmc_path)
elapsed <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("Loaded: %d cells in %.2fs\n", ncol(cbmc), elapsed))
#> Loaded: 8617 cells in 3.03s
cat(sprintf("Assays: %s\n", paste(names(cbmc@assays), collapse = ", ")))
#> Assays: RNA, ADT
cat(sprintf("RNA features: %d\n", nrow(cbmc[["RNA"]])))
#> RNA features: 20501
adt_name <- intersect(names(cbmc@assays), c("ADT", "CITE", "PROT", "prot"))
if (length(adt_name) > 0) {
cat(sprintf("ADT features (%s): %d\n",
adt_name[1], nrow(cbmc[[adt_name[1]]])))
cat(sprintf("ADT markers: %s\n",
paste(rownames(cbmc[[adt_name[1]]]), collapse = ", ")))
}
#> ADT features (ADT): 10
#> ADT markers: CD3, CD4, CD8, CD45RA, CD56, CD16, CD11c, CD14, CD19, CD34
Normalize RNA counts with log-normalization, select variable features, run PCA, build a neighbor graph, cluster with the Leiden algorithm, and compute a UMAP projection.
DefaultAssay(cbmc) <- "RNA"
cbmc <- NormalizeData(cbmc, verbose = FALSE)
cbmc <- FindVariableFeatures(cbmc, nfeatures = 2000, verbose = FALSE)
cbmc <- ScaleData(cbmc, verbose = FALSE)
cbmc <- RunPCA(cbmc, verbose = FALSE)
cbmc <- FindNeighbors(cbmc, dims = 1:30, verbose = FALSE)
cbmc <- FindClusters(cbmc, resolution = 0.8, verbose = FALSE)
cbmc <- RunUMAP(cbmc, dims = 1:30, verbose = FALSE)
cat(sprintf("RNA clusters: %d\n", length(unique(cbmc$seurat_clusters))))
#> RNA clusters: 20
DimPlot(cbmc, group.by = "seurat_clusters", label = TRUE, pt.size = 0.3) +
labs(title = "CBMC CITE-seq: RNA clusters") +
theme(plot.title = element_text(size = 12),
axis.text = element_text(color = "black"),
axis.title = element_text(color = "black"))
If cell-type annotations are available from the original dataset, they can be overlaid alongside cluster labels:
if ("rna_annotations" %in% colnames(cbmc@meta.data)) {
DimPlot(cbmc, group.by = "rna_annotations", label = TRUE,
pt.size = 0.3, repel = TRUE) +
labs(title = "CBMC: RNA cell-type annotations") +
theme(plot.title = element_text(size = 12),
axis.text = element_text(color = "black"),
axis.title = element_text(color = "black"),
legend.text = element_text(size = 8))
}
CLR (centered log-ratio) normalization is the recommended approach for ADT data. After normalization the protein counts are projected onto the UMAP computed from RNA to visualize protein expression in the RNA embedding space.
if (length(adt_name) > 0) {
DefaultAssay(cbmc) <- adt_name[1]
cbmc <- NormalizeData(cbmc,
normalization.method = "CLR",
margin = 2,
verbose = FALSE)
}
if (length(adt_name) > 0) {
adt_features <- rownames(cbmc[[adt_name[1]]])
# Select two markers covering distinct lineages: T cells (CD3) and monocytes (CD14)
plot_features <- intersect(c("CD3", "CD14"), adt_features)
if (length(plot_features) < 2) {
plot_features <- adt_features[seq_len(min(2L, length(adt_features)))]
}
DefaultAssay(cbmc) <- adt_name[1]
p <- FeaturePlot(cbmc, features = plot_features, ncol = 2,
reduction = "umap", min.cutoff = "q05")
print(p)
}
writeH5MU() serializes all Seurat assays as separate
modalities in a single h5mu file. Shared cell metadata, cluster labels,
and dimensionality reduction embeddings are written to the root-level
obs and obsm groups.
h5mu_path <- file.path(tempdir(), "cbmc.h5mu")
t0 <- proc.time()
writeH5MU(cbmc, h5mu_path, overwrite = TRUE)
elapsed <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("Wrote h5mu: %.2fs | %.1f MB\n",
elapsed, file.size(h5mu_path) / 1e6))
#> Wrote h5mu: 2.32s | 49.1 MB
The exported h5mu is read by mudata to confirm that both modalities are present with the correct cell and feature counts.
import mudata
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
mdata = mudata.read_h5mu(r.h5mu_path)
print(mdata)
#> MuData object with n_obs × n_vars = 8617 × 20511
#> obs: 'nCount_ADT', 'nFeature_ADT', 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'rna_annotations', 'protein_annotations', 'RNA_snn_res.0.8', 'seurat_clusters'
#> 2 modalities
#> rna: 8617 x 20501
#> var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
#> obsm: 'X_pca', 'X_umap'
#> layers: 'counts'
#> obsp: 'nn', 'snn'
#> prot: 8617 x 10
#> layers: 'counts'
print(f"\nModalities: {list(mdata.mod.keys())}")
#>
#> Modalities: ['rna', 'prot']
for name, mod in mdata.mod.items():
print(f" {name}: {mod.n_obs} cells x {mod.n_vars} features")
#> rna: 8617 cells x 20501 features
#> prot: 8617 cells x 10 features
Seurat assay names are lowercased when written to h5mu to follow the
MuData convention. The standard ADT assay is mapped to
prot.
| Seurat assay | h5mu modality | Notes |
|---|---|---|
| RNA | rna | Standard RNA assay |
| ADT | prot | Antibody-derived tags |
| ATAC | atac | Chromatin accessibility |
| Other | lowercase name | Custom modalities |
To read the h5mu back into Seurat:
cbmc_rt <- readH5MU(h5mu_path)
cat(sprintf("Round-trip: %d cells, assays: %s\n",
ncol(cbmc_rt),
paste(names(cbmc_rt@assays), collapse = ", ")))
#> Round-trip: 8617 cells, assays: ADT, RNA