Multimodal CITE-seq with MuData (h5mu): Real CBMC Data

CITE-seq measures RNA and surface protein (ADT, antibody-derived tags) simultaneously from the same cells, enabling joint transcriptomic and proteomic profiling. The MuData h5mu format stores both modalities in a single file, making it straightforward to exchange multimodal datasets between Seurat (R) and muon/mudata (Python) without losing either modality or the shared cell metadata.

scConvert reads and writes h5mu without any Python dependency, relying entirely on hdf5r for HDF5 I/O.

Load real CITE-seq data (CBMC)

The cbmc_rna.h5seurat file contains the cord blood mononuclear cell (CBMC) CITE-seq dataset from Stoeckius et al. (2017): 8,617 cells with 20,501 RNA features and 10 ADT surface markers.

cbmc_path <- "../cbmc_rna.h5seurat"

t0      <- proc.time()
cbmc    <- readH5Seurat(cbmc_path)
elapsed <- (proc.time() - t0)[["elapsed"]]

cat(sprintf("Loaded: %d cells in %.2fs\n", ncol(cbmc), elapsed))
#> Loaded: 8617 cells in 3.03s
cat(sprintf("Assays: %s\n", paste(names(cbmc@assays), collapse = ", ")))
#> Assays: RNA, ADT
cat(sprintf("RNA features: %d\n", nrow(cbmc[["RNA"]])))
#> RNA features: 20501

adt_name <- intersect(names(cbmc@assays), c("ADT", "CITE", "PROT", "prot"))
if (length(adt_name) > 0) {
  cat(sprintf("ADT features (%s): %d\n",
              adt_name[1], nrow(cbmc[[adt_name[1]]])))
  cat(sprintf("ADT markers: %s\n",
              paste(rownames(cbmc[[adt_name[1]]]), collapse = ", ")))
}
#> ADT features (ADT): 10
#> ADT markers: CD3, CD4, CD8, CD45RA, CD56, CD16, CD11c, CD14, CD19, CD34

RNA analysis

Normalize RNA counts with log-normalization, select variable features, run PCA, build a neighbor graph, cluster with the Leiden algorithm, and compute a UMAP projection.

DefaultAssay(cbmc) <- "RNA"

cbmc <- NormalizeData(cbmc, verbose = FALSE)
cbmc <- FindVariableFeatures(cbmc, nfeatures = 2000, verbose = FALSE)
cbmc <- ScaleData(cbmc, verbose = FALSE)
cbmc <- RunPCA(cbmc, verbose = FALSE)
cbmc <- FindNeighbors(cbmc, dims = 1:30, verbose = FALSE)
cbmc <- FindClusters(cbmc, resolution = 0.8, verbose = FALSE)
cbmc <- RunUMAP(cbmc, dims = 1:30, verbose = FALSE)

cat(sprintf("RNA clusters: %d\n", length(unique(cbmc$seurat_clusters))))
#> RNA clusters: 20

DimPlot(cbmc, group.by = "seurat_clusters", label = TRUE, pt.size = 0.3) +
  labs(title = "CBMC CITE-seq: RNA clusters") +
  theme(plot.title = element_text(size = 12),
        axis.text  = element_text(color = "black"),
        axis.title = element_text(color = "black"))

If cell-type annotations are available from the original dataset, they can be overlaid alongside cluster labels:

if ("rna_annotations" %in% colnames(cbmc@meta.data)) {
  DimPlot(cbmc, group.by = "rna_annotations", label = TRUE,
          pt.size = 0.3, repel = TRUE) +
    labs(title = "CBMC: RNA cell-type annotations") +
    theme(plot.title  = element_text(size = 12),
          axis.text   = element_text(color = "black"),
          axis.title  = element_text(color = "black"),
          legend.text = element_text(size = 8))
}

ADT protein expression

CLR (centered log-ratio) normalization is the recommended approach for ADT data. After normalization the protein counts are projected onto the UMAP computed from RNA to visualize protein expression in the RNA embedding space.

if (length(adt_name) > 0) {
  DefaultAssay(cbmc) <- adt_name[1]
  cbmc <- NormalizeData(cbmc,
                        normalization.method = "CLR",
                        margin = 2,
                        verbose = FALSE)
}

if (length(adt_name) > 0) {
  adt_features <- rownames(cbmc[[adt_name[1]]])

  # Select two markers covering distinct lineages: T cells (CD3) and monocytes (CD14)
  plot_features <- intersect(c("CD3", "CD14"), adt_features)
  if (length(plot_features) < 2) {
    plot_features <- adt_features[seq_len(min(2L, length(adt_features)))]
  }

  DefaultAssay(cbmc) <- adt_name[1]
  p <- FeaturePlot(cbmc, features = plot_features, ncol = 2,
                   reduction = "umap", min.cutoff = "q05")
  print(p)
}

Export to h5mu

writeH5MU() serializes all Seurat assays as separate modalities in a single h5mu file. Shared cell metadata, cluster labels, and dimensionality reduction embeddings are written to the root-level obs and obsm groups.

h5mu_path <- file.path(tempdir(), "cbmc.h5mu")

t0      <- proc.time()
writeH5MU(cbmc, h5mu_path, overwrite = TRUE)
elapsed <- (proc.time() - t0)[["elapsed"]]

cat(sprintf("Wrote h5mu: %.2fs | %.1f MB\n",
            elapsed, file.size(h5mu_path) / 1e6))
#> Wrote h5mu: 2.32s | 49.1 MB

Python validation with mudata

The exported h5mu is read by mudata to confirm that both modalities are present with the correct cell and feature counts.

import mudata
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

mdata = mudata.read_h5mu(r.h5mu_path)
print(mdata)
#> MuData object with n_obs × n_vars = 8617 × 20511
#>   obs:   'nCount_ADT', 'nFeature_ADT', 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'rna_annotations', 'protein_annotations', 'RNA_snn_res.0.8', 'seurat_clusters'
#>   2 modalities
#>     rna: 8617 x 20501
#>       var:   'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'vf_vst_counts_mean', 'vf_vst_counts_variance', 'vf_vst_counts_variance.expected', 'vf_vst_counts_variance.standardized', 'vf_vst_counts_variable', 'vf_vst_counts_rank', 'var.features', 'var.features.rank'
#>       obsm:  'X_pca', 'X_umap'
#>       layers:    'counts'
#>       obsp:  'nn', 'snn'
#>     prot:    8617 x 10
#>       layers:    'counts'
print(f"\nModalities: {list(mdata.mod.keys())}")
#> 
#> Modalities: ['rna', 'prot']
for name, mod in mdata.mod.items():
    print(f"  {name}: {mod.n_obs} cells x {mod.n_vars} features")
#>   rna: 8617 cells x 20501 features
#>   prot: 8617 cells x 10 features

Modality name mapping

Seurat assay names are lowercased when written to h5mu to follow the MuData convention. The standard ADT assay is mapped to prot.

Seurat assay	h5mu modality	Notes
RNA	rna	Standard RNA assay
ADT	prot	Antibody-derived tags
ATAC	atac	Chromatin accessibility
Other	lowercase name	Custom modalities

To read the h5mu back into Seurat:

cbmc_rt <- readH5MU(h5mu_path)
cat(sprintf("Round-trip: %d cells, assays: %s\n",
            ncol(cbmc_rt),
            paste(names(cbmc_rt@assays), collapse = ", ")))
#> Round-trip: 8617 cells, assays: ADT, RNA