Zarr is a directory-based format for chunked compressed arrays. Unlike HDF5, every chunk is stored as an independent file, making Zarr cloud-native: individual chunks can be read directly from S3, GCS, or Azure Blob Storage without downloading the full dataset. Zarr is the storage format used by CELLxGENE, Human Cell Atlas, and scverse streaming pipelines.
scConvert reads and writes Zarr v2 stores following the AnnData on-disk specification. Three conversion paths are available:
scConvert() dispatcher (hub path through Seurat)readZarr() / writeZarr() (load/save from
R)H5ADToZarr() / ZarrToH5AD() (streaming, no
Seurat intermediate)This article demonstrates all three paths on the PBMC 3k dataset (2,638 cells, 13,714 genes) and validates the output with Python zarr and anndata.
scConvert() detects output format from the file
extension and routes through the Seurat hub. This path preserves all
obs, var, obsm, obsp, and uns fields.
input_h5ad <- "../pbmc3k.h5ad"
zarr_path <- file.path(tempdir(), "pbmc3k.zarr")
t0 <- proc.time()
scConvert(input_h5ad, dest = zarr_path, overwrite = TRUE, verbose = FALSE)
elapsed <- (proc.time() - t0)[["elapsed"]]
n_files <- length(list.files(zarr_path, recursive = TRUE))
cat(sprintf("Converted to Zarr: %.2fs | %d chunk files\n", elapsed, n_files))
#> Converted to Zarr: 2.29s | 27 chunk files
Each chunk file in the Zarr store corresponds to a tile of a compressed array. The number of files reflects the chunking strategy applied to X, obsm embeddings, and metadata arrays.
readZarr() reconstructs a Seurat object from the store,
mapping AnnData conventions back to Seurat slots (X->data layer,
raw/X->counts layer, obsm->reductions, obsp->graphs).
pbmc <- readZarr(zarr_path, verbose = FALSE)
cat(sprintf("Loaded: %d cells x %d genes\n", ncol(pbmc), nrow(pbmc)))
#> Loaded: 2638 cells x 13714 genes
cat(sprintf("Reductions: %s\n", paste(names(pbmc@reductions), collapse = ", ")))
#> Reductions: pca, umap
pbmc
#> An object of class Seurat
#> 13714 features across 2638 samples within 1 assay
#> Active assay: RNA (13714 features, 0 variable features)
#> 2 layers present: counts, data
#> 2 dimensional reductions calculated: pca, umap
head(pbmc[[]], 4)
DimPlot(
pbmc,
reduction = "umap",
group.by = "seurat_annotations",
label = TRUE,
label.size = 3.5,
repel = TRUE
) +
ggtitle("PBMC 3k: cell-type annotations (from Zarr)") +
theme(plot.title = element_text(hjust = 0.5))
PBMC 3k UMAP coloured by cell-type annotation after Zarr round-trip.
We confirm the Zarr store is readable with Python zarr and anndata.
library(reticulate)
Sys.setenv(NUMBA_THREADING_LAYER = "tbb", OMP_NUM_THREADS = "1")
use_condaenv("scverse")
import zarr
import anndata
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
store = zarr.open(r.zarr_path, mode="r")
print("Top-level Zarr groups:", list(store.keys()))
#> Top-level Zarr groups: ['obsm', 'layers', 'obsp', 'var', 'obs', 'uns', 'X']
adata = anndata.read_zarr(r.zarr_path)
print(adata)
#> AnnData object with n_obs × n_vars = 2638 × 13714
#> obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations', 'percent.mt', 'RNA_snn_res.0.5', 'seurat_clusters'
#> obsm: 'X_pca', 'X_umap'
#> layers: 'data'
#> obsp: 'connectivities', 'distances'
print(f"Validated: {adata.n_obs} cells x {adata.n_vars} genes")
#> Validated: 2638 cells x 13714 genes
print(f"Obs columns: {list(adata.obs.columns)[:5]}")
#> Obs columns: ['orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations', 'percent.mt']
print(f"Obsm keys: {list(adata.obsm.keys())}")
#> Obsm keys: ['X_pca', 'X_umap']
anndata reads the store natively, confirming the encoding attributes, chunk shapes, and dtype conventions are all compliant with the AnnData Zarr specification.
H5ADToZarr() streams fields from HDF5 directly into Zarr
chunks without constructing a Seurat object. This is the lowest-memory
conversion path and is appropriate when you need to reformat large
datasets that do not fit in R memory.
zarr2_path <- file.path(tempdir(), "pbmc3k_stream.zarr")
t0 <- proc.time()
H5ADToZarr(input_h5ad, zarr2_path, overwrite = TRUE, verbose = FALSE)
elapsed2 <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("H5ADToZarr (streaming): %.2fs\n", elapsed2))
#> H5ADToZarr (streaming): 0.86s
The symmetric streaming converter ZarrToH5AD() writes a
valid h5ad from any Zarr store, including stores downloaded from cloud
sources.
rt_h5ad <- file.path(tempdir(), "pbmc3k_from_zarr.h5ad")
ZarrToH5AD(zarr_path, rt_h5ad, overwrite = TRUE, verbose = FALSE)
rt <- readH5AD(rt_h5ad, verbose = FALSE)
cat(sprintf("Round-trip: %d cells x %d genes\n", ncol(rt), nrow(rt)))
#> Round-trip: 2638 cells x 13714 genes
| Path | Function | Seurat object built | Best for |
|---|---|---|---|
| h5ad -> Zarr (hub) | scConvert() |
Yes | Full metadata round-trip |
| h5ad -> Zarr (stream) | H5ADToZarr() |
No | Large files, low memory |
| Zarr -> h5ad (stream) | ZarrToH5AD() |
No | Cloud-to-HDF5 reformat |
| h5seurat -> Zarr | H5SeuratToZarr() |
No | Seurat-native streaming |
| Zarr -> h5seurat | ZarrToH5Seurat() |
No | Seurat-native streaming |
unlink(zarr_path, recursive = TRUE)
unlink(zarr2_path, recursive = TRUE)
unlink(rt_h5ad)