The h5ad format is the standard on-disk representation for scanpy, scvi-tools, and CELLxGENE. scConvert lets you move data between h5ad and Seurat objects without any Python dependency: all I/O goes through hdf5r.
This article walks through the full round-trip on the PBMC 3k dataset (2,700 cells, 32,738 genes) and confirms the written h5ad is readable and analysable in Python/scanpy.
readH5AD() reads the AnnData file and returns a
Seurat object. It maps X to the
data layer (normalized counts when present) and
raw/X to the counts layer.
h5ad_path <- "../pbmc3k.h5ad"
t0 <- proc.time()
pbmc <- readH5AD(h5ad_path, verbose = FALSE)
elapsed <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("readH5AD: %.2fs | %d cells x %d genes\n",
elapsed, ncol(pbmc), nrow(pbmc)))
#> readH5AD: 0.85s | 2638 cells x 13714 genes
pbmc
#> An object of class Seurat
#> 13714 features across 2638 samples within 1 assay
#> Active assay: RNA (13714 features, 0 variable features)
#> 2 layers present: counts, data
#> 2 dimensional reductions calculated: pca, umap
head(pbmc[[]], 4)
We run the canonical single-cell workflow on the loaded object. The h5ad may already contain embeddings from the source; re-running the pipeline here ensures the Seurat object is fully self-consistent and demonstrates that the imported data is analytically valid.
set.seed(42L)
pbmc <- NormalizeData(pbmc, verbose = FALSE)
pbmc <- FindVariableFeatures(pbmc, nfeatures = 2000L, verbose = FALSE)
pbmc <- ScaleData(pbmc, verbose = FALSE)
pbmc <- RunPCA(pbmc, npcs = 30L, verbose = FALSE)
pbmc <- RunUMAP(pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:20, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)
DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 4) +
ggtitle("PBMC 3k: Seurat clusters (from h5ad import)") +
theme(plot.title = element_text(hjust = 0.5))
PBMC 3k UMAP coloured by Seurat clusters after h5ad import.
writeH5AD() serialises the Seurat object to a
spec-compliant AnnData file. Reductions are stored in obsm,
graphs in obsp, and metadata in obs.
tmp_h5ad <- file.path(tempdir(), "pbmc3k_seurat.h5ad")
t0 <- proc.time()
writeH5AD(pbmc, filename = tmp_h5ad, overwrite = TRUE, verbose = FALSE)
elapsed <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("writeH5AD: %.2fs\n", elapsed))
#> writeH5AD: 1.35s
cat(sprintf("Output: %.1f MB\n", file.size(tmp_h5ad) / 1e6))
#> Output: 13.5 MB
| Seurat slot | h5ad path | Notes |
|---|---|---|
counts layer |
/raw/X |
raw integer counts |
data layer |
/X |
normalised expression |
scale.data layer |
/layers/scale.data |
scaled matrix if present |
meta.data |
/obs |
cell-level annotations |
var rownames + metadata |
/var |
feature annotations |
reductions (e.g. PCA, UMAP) |
/obsm/X_pca, /obsm/X_umap |
cell embeddings |
graphs (SNN, NN) |
/obsp/ |
sparse adjacency matrices |
misc |
/uns |
unstructured metadata |
We read the written h5ad back in Python/scanpy to confirm the file is
correctly formed. This requires the scverse conda
environment.
library(reticulate)
Sys.setenv(NUMBA_THREADING_LAYER = "tbb", OMP_NUM_THREADS = "1")
use_condaenv("scverse")
import anndata
import scanpy as sc
adata = anndata.read_h5ad(r.tmp_h5ad)
print(adata)
#> AnnData object with n_obs × n_vars = 2638 × 13714
#> obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations', 'percent.mt', 'RNA_snn_res.0.5', 'seurat_clusters'
#> obsm: 'X_pca', 'X_umap'
#> obsp: 'RNA_nn', 'RNA_snn'
Run scanpy’s standard pipeline on the imported object:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.pca(adata, n_comps=30, random_state=42)
sc.pp.neighbors(adata, n_pcs=20, random_state=42)
sc.tl.leiden(adata, resolution=0.5, random_state=42)
sc.tl.umap(adata, random_state=42)
sc.pl.umap(adata, color="leiden", show=True)
PBMC 3k UMAP from scanpy after round-trip through scConvert.
n_clusters = len(adata.obs["leiden"].unique())
print(
f"Python validation: {adata.n_obs} cells x {adata.n_vars} genes, "
f"{n_clusters} clusters"
)
#> Python validation: 2638 cells x 13714 genes, 7 clusters
unlink(tmp_h5ad)