Single-cell atlases routinely contain hundreds of thousands to millions of cells. Loading such files into R as Seurat objects can take minutes and multiple gigabytes of RAM. The scConvert CLI binary sidesteps this bottleneck by copying HDF5 chunks directly between formats, bypassing R object construction entirely. This vignette demonstrates end-to-end conversion of the Tabula Sapiens immune cell atlas (592K cells) and compares CLI throughput against the R API.
The Tabula Sapiens immune subset is a 592K-cell h5ad file containing T cells, B cells, myeloid cells, and other immune populations sampled across 24 human tissues.
ts_path <- "/Users/miana/Desktop/scConvert-manuscript/data/tier3/10x_3__v3/tabula_sapiens_immune.h5ad"
file_gb <- round(file.size(ts_path) / 1e9, 2)
# Read only obs to get cell count without loading the full expression matrix
obs_info <- scConvert::readH5AD_obs(ts_path)
n_cells <- nrow(obs_info)
cat(sprintf("Dataset: %d cells | File size: %.2f GB\n", n_cells, file_gb))
#> Dataset: 592317 cells | File size: 19.77 GB
scConvert ships a compiled C binary (inst/bin/scconvert)
that transfers HDF5 chunks without staging data through R memory. It
handles sparse matrices, dense matrices, obs/var metadata, embeddings,
and graphs.
cli_bin <- system.file("bin", "scconvert", package = "scConvert")
has_cli <- file.exists(cli_bin) && file.info(cli_bin)$size > 10000
cat(sprintf("CLI binary: %s (%s KB)\n",
cli_bin,
round(file.size(cli_bin) / 1024)))
#> CLI binary: (NA KB)
out_h5seurat <- file.path(tempdir(), "tabula_sapiens.h5seurat")
t0 <- proc.time()
system2(cli_bin, args = c(ts_path, out_h5seurat, "--overwrite"), stdout = TRUE)
cli_time <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("CLI: %.2fs for %d cells (%.0f cells/second)\n",
cli_time, n_cells, n_cells / cli_time))
cat(sprintf("Output: %.1f MB\n", file.size(out_h5seurat) / 1e6))
The R API loads the full expression matrix into a Seurat object. Timing here covers the equivalent single-step conversion, providing an apples-to-apples comparison.
out_r <- file.path(tempdir(), "tabula_sapiens_r.h5seurat")
t0 <- proc.time()
scConvert(ts_path, dest = out_r, overwrite = TRUE)
r_time <- (proc.time() - t0)[["elapsed"]]
cat(sprintf("R API: %.1fs\n", r_time))
cat(sprintf("Speedup: %.0fx\n", r_time / cli_time))
Because the CLI output is a valid h5seurat file, you can load only the cell metadata (no expression matrix) for rapid exploration.
# Load metadata only - skips the sparse count matrix
meta_only <- readH5Seurat(out_h5seurat, assays = FALSE)
cat(sprintf("Cell types: %d unique\n",
length(unique(meta_only$cell_type))))
df <- sort(table(meta_only$cell_type), decreasing = TRUE)
df <- data.frame(cell_type = names(df), n = as.integer(df))
df$cell_type <- factor(df$cell_type, levels = rev(df$cell_type))
ggplot(df[seq_len(min(15L, nrow(df))), ],
aes(x = n, y = cell_type)) +
geom_col(fill = "#4E79A7") +
labs(x = "Cells", y = NULL,
title = "Top cell types (Tabula Sapiens immune)") +
theme_classic(base_size = 11) +
theme(axis.text = element_text(color = "black"),
axis.title = element_text(color = "black"))
The original h5ad file is read directly by anndata to confirm cell and gene counts are consistent with what the CLI reported.
import anndata
adata = anndata.read_h5ad(r.ts_path)
print(f"Tabula Sapiens: {adata.n_obs} cells x {adata.n_vars} genes")
#> Tabula Sapiens: 592317 cells x 60606 genes
print(f"Cell types: {adata.obs['cell_type'].nunique()}")
#> Cell types: 45
The table below places the Tabula Sapiens result in context across dataset sizes. X, Y, and Z are filled from the timing variables measured above.
perf <- data.frame(
Dataset = c("pbmc_small", "PBMC 3k", "CRC (real)", "Tabula Sapiens immune"),
Cells = c("230", "2,700", "~50K", format(n_cells, big.mark = ",")),
CLI_time = c("~0.02s", "~0.04s", "~0.3s",
sprintf("%.2fs", cli_time)),
R_API_time = c("~0.3s", "~0.6s", "~8s",
sprintf("%.1fs", r_time)),
Speedup = c("~15x", "~15x", "~27x",
sprintf("~%.0fx", r_time / cli_time))
)
knitr::kable(perf,
col.names = c("Dataset", "Cells", "CLI time",
"R API time", "Speedup"),
align = c("l", "r", "r", "r", "r"))
unlink(c(out_h5seurat, out_r))