Overview

The h5ad format is the standard on-disk representation for scanpy, scvi-tools, and CELLxGENE. scConvert lets you move data between h5ad and Seurat objects without any Python dependency: all I/O goes through hdf5r.

This article walks through the full round-trip on the PBMC 3k dataset (2,700 cells, 32,738 genes) and confirms the written h5ad is readable and analysable in Python/scanpy.


1 Load h5ad into Seurat

readH5AD() reads the AnnData file and returns a Seurat object. It maps X to the data layer (normalized counts when present) and raw/X to the counts layer.

h5ad_path <- "../pbmc3k.h5ad"

t0 <- proc.time()
pbmc <- readH5AD(h5ad_path, verbose = FALSE)
elapsed <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("readH5AD: %.2fs | %d cells x %d genes\n",
            elapsed, ncol(pbmc), nrow(pbmc)))
#> readH5AD: 0.85s | 2638 cells x 13714 genes
pbmc
#> An object of class Seurat 
#> 13714 features across 2638 samples within 1 assay 
#> Active assay: RNA (13714 features, 0 variable features)
#>  2 layers present: counts, data
#>  2 dimensional reductions calculated: pca, umap
head(pbmc[[]], 4)

2 Standard Seurat analysis pipeline

We run the canonical single-cell workflow on the loaded object. The h5ad may already contain embeddings from the source; re-running the pipeline here ensures the Seurat object is fully self-consistent and demonstrates that the imported data is analytically valid.

set.seed(42L)
pbmc <- NormalizeData(pbmc, verbose = FALSE)
pbmc <- FindVariableFeatures(pbmc, nfeatures = 2000L, verbose = FALSE)
pbmc <- ScaleData(pbmc, verbose = FALSE)
pbmc <- RunPCA(pbmc, npcs = 30L, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 4) +
  ggtitle("PBMC 3k: Seurat clusters (from h5ad import)") +
  theme(plot.title = element_text(hjust = 0.5))
PBMC 3k UMAP coloured by Seurat clusters after h5ad import.

PBMC 3k UMAP coloured by Seurat clusters after h5ad import.


3 Write Seurat back to h5ad

writeH5AD() serialises the Seurat object to a spec-compliant AnnData file. Reductions are stored in obsm, graphs in obsp, and metadata in obs.

tmp_h5ad <- file.path(tempdir(), "pbmc3k_seurat.h5ad")

t0 <- proc.time()
writeH5AD(pbmc, filename = tmp_h5ad, overwrite = TRUE, verbose = FALSE)
elapsed <- (proc.time() - t0)[["elapsed"]]

cat(sprintf("writeH5AD: %.2fs\n", elapsed))
#> writeH5AD: 1.35s
cat(sprintf("Output: %.1f MB\n", file.size(tmp_h5ad) / 1e6))
#> Output: 13.5 MB

4 Layer mapping reference

Seurat slot h5ad path Notes
counts layer /raw/X raw integer counts
data layer /X normalised expression
scale.data layer /layers/scale.data scaled matrix if present
meta.data /obs cell-level annotations
var rownames + metadata /var feature annotations
reductions (e.g. PCA, UMAP) /obsm/X_pca, /obsm/X_umap cell embeddings
graphs (SNN, NN) /obsp/ sparse adjacency matrices
misc /uns unstructured metadata

5 Python validation

We read the written h5ad back in Python/scanpy to confirm the file is correctly formed. This requires the scverse conda environment.

library(reticulate)
Sys.setenv(NUMBA_THREADING_LAYER = "tbb", OMP_NUM_THREADS = "1")
use_condaenv("scverse")
import anndata
import scanpy as sc

adata = anndata.read_h5ad(r.tmp_h5ad)
print(adata)
#> AnnData object with n_obs × n_vars = 2638 × 13714
#>     obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations', 'percent.mt', 'RNA_snn_res.0.5', 'seurat_clusters'
#>     obsm: 'X_pca', 'X_umap'
#>     obsp: 'RNA_nn', 'RNA_snn'

Run scanpy’s standard pipeline on the imported object:

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.pca(adata, n_comps=30, random_state=42)
sc.pp.neighbors(adata, n_pcs=20, random_state=42)
sc.tl.leiden(adata, resolution=0.5, random_state=42)
sc.tl.umap(adata, random_state=42)
sc.pl.umap(adata, color="leiden", show=True)
PBMC 3k UMAP from scanpy after round-trip through scConvert.

PBMC 3k UMAP from scanpy after round-trip through scConvert.

n_clusters = len(adata.obs["leiden"].unique())
print(
    f"Python validation: {adata.n_obs} cells x {adata.n_vars} genes, "
    f"{n_clusters} clusters"
)
#> Python validation: 2638 cells x 13714 genes, 7 clusters

6 Cleanup

unlink(tmp_h5ad)