Seurat <-> AnnData (h5ad): Real-Data Workflow with Python Validation

Overview

The h5ad format is the standard on-disk representation for scanpy, scvi-tools, and CELLxGENE. scConvert lets you move data between h5ad and Seurat objects without any Python dependency: all I/O goes through hdf5r.

This article walks through the full round-trip on the PBMC 3k dataset (2,700 cells, 32,738 genes) and confirms the written h5ad is readable and analysable in Python/scanpy.

1 Load h5ad into Seurat

readH5AD() reads the AnnData file and returns a Seurat object. It maps X to the data layer (normalized counts when present) and raw/X to the counts layer.

h5ad_path <- "../pbmc3k.h5ad"

t0 <- proc.time()
pbmc <- readH5AD(h5ad_path, verbose = FALSE)
elapsed <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("readH5AD: %.2fs | %d cells x %d genes\n",
            elapsed, ncol(pbmc), nrow(pbmc)))
#> readH5AD: 0.85s | 2638 cells x 13714 genes

pbmc
#> An object of class Seurat 
#> 13714 features across 2638 samples within 1 assay 
#> Active assay: RNA (13714 features, 0 variable features)
#>  2 layers present: counts, data
#>  2 dimensional reductions calculated: pca, umap
head(pbmc[[]], 4)

2 Standard Seurat analysis pipeline

We run the canonical single-cell workflow on the loaded object. The h5ad may already contain embeddings from the source; re-running the pipeline here ensures the Seurat object is fully self-consistent and demonstrates that the imported data is analytically valid.

set.seed(42L)
pbmc <- NormalizeData(pbmc, verbose = FALSE)
pbmc <- FindVariableFeatures(pbmc, nfeatures = 2000L, verbose = FALSE)
pbmc <- ScaleData(pbmc, verbose = FALSE)
pbmc <- RunPCA(pbmc, npcs = 30L, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)

DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 4) +
  ggtitle("PBMC 3k: Seurat clusters (from h5ad import)") +
  theme(plot.title = element_text(hjust = 0.5))

PBMC 3k UMAP coloured by Seurat clusters after h5ad import.

3 Write Seurat back to h5ad

writeH5AD() serialises the Seurat object to a spec-compliant AnnData file. Reductions are stored in obsm, graphs in obsp, and metadata in obs.

tmp_h5ad <- file.path(tempdir(), "pbmc3k_seurat.h5ad")

t0 <- proc.time()
writeH5AD(pbmc, filename = tmp_h5ad, overwrite = TRUE, verbose = FALSE)
elapsed <- (proc.time() - t0)[["elapsed"]]

cat(sprintf("writeH5AD: %.2fs\n", elapsed))
#> writeH5AD: 1.35s
cat(sprintf("Output: %.1f MB\n", file.size(tmp_h5ad) / 1e6))
#> Output: 13.5 MB

4 Layer mapping reference

Seurat slot	h5ad path	Notes
`counts` layer	`/raw/X`	raw integer counts
`data` layer	`/X`	normalised expression
`scale.data` layer	`/layers/scale.data`	scaled matrix if present
`meta.data`	`/obs`	cell-level annotations
`var` rownames + metadata	`/var`	feature annotations
`reductions` (e.g. PCA, UMAP)	`/obsm/X_pca`, `/obsm/X_umap`	cell embeddings
`graphs` (SNN, NN)	`/obsp/`	sparse adjacency matrices
`misc`	`/uns`	unstructured metadata

5 Python validation

We read the written h5ad back in Python/scanpy to confirm the file is correctly formed. This requires the scverse conda environment.

library(reticulate)
Sys.setenv(NUMBA_THREADING_LAYER = "tbb", OMP_NUM_THREADS = "1")
use_condaenv("scverse")

import anndata
import scanpy as sc

adata = anndata.read_h5ad(r.tmp_h5ad)
print(adata)
#> AnnData object with n_obs × n_vars = 2638 × 13714
#>     obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations', 'percent.mt', 'RNA_snn_res.0.5', 'seurat_clusters'
#>     obsm: 'X_pca', 'X_umap'
#>     obsp: 'RNA_nn', 'RNA_snn'

Run scanpy’s standard pipeline on the imported object:

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.pca(adata, n_comps=30, random_state=42)
sc.pp.neighbors(adata, n_pcs=20, random_state=42)
sc.tl.leiden(adata, resolution=0.5, random_state=42)
sc.tl.umap(adata, random_state=42)
sc.pl.umap(adata, color="leiden", show=True)

PBMC 3k UMAP from scanpy after round-trip through scConvert.

n_clusters = len(adata.obs["leiden"].unique())
print(
    f"Python validation: {adata.n_obs} cells x {adata.n_vars} genes, "
    f"{n_clusters} clusters"
)
#> Python validation: 2638 cells x 13714 genes, 7 clusters

6 Cleanup

unlink(tmp_h5ad)