Changelog
Source:NEWS.md
scConvert 0.3.0 (development)
New features
readZarr()index slicing:obs_idxandvar_idxpush down to chunk fetches. Caller supplies integer indices (or a logical mask) for cells / features; only the chunks containing those indices are pulled from the store. The pushdown covers obs/var cell-and-feature index reads, obs/var metadata columns, obsm embeddings, obsp graphs, varp matrices, dense X / layers (row + column chunk selection), and sparse X in CSR/CSC via indptr (a small pre-read locates the data ranges for the requested rows/cols; only those blocks are fetched). CSR + column slicing and CSC + row slicing fall back to an in-memory subset of the row/col axis they cannot push down efficiently.readZarr()selection API: skip whole AnnData groups you don’t need. New argumentslayers,obsm,obsp,varm,varp,uns, andinclude_xlet callers narrow what the reader pulls from the store. Each acceptsNULL(read all, current behavior),character(0)(drop the whole group), or a character vector of item names (read only those).include_x = FALSEskips the main expression matrix entirely and returns a Seurat object with a zero-entry placeholder. On remote S3/GCS stores this directly translates to fewer HTTP GETs: a metadata-only read of a 50 GB AnnData-Zarr now fetches only the obs columns, not the X chunks. (Per-cell / per-gene index slicing –obs_idx,var_idx– is the next step and is not in this release.)readZarr()for remote URLs is now per-chunk lazy. A new storage abstraction (R/ZarrStore.R) replaces the previous download-then-read mirror. Each chunk is fetched on demand via HTTP GET, with an optional disk cache keyed by URL hash so subsequent calls reuse local copies. The 8 low-level zarr-read helpers inR/AnnDataEncoding.Rnow accept either a string path or a store object; the 70 existing call sites in the local-fs conversion path are unchanged.writeZarr()gains pluggable compression viacompressor=. Accepts"zstd","zlib"(alias"gzip"),"blosc","none", or an explicit list spec;NULL(default) auto-selects Zstd when a Zstd codec package is on the search path, otherwise zlib (current behavior)..zarr_compressand.zarr_decompressnow know the Zstd codec. Caveat: as of 2026 no maintained CRAN Zstd-bytes wrapper exists for R (zstdlite was removed in 2024 for policy violation); the auto default therefore falls through to zlib for most users. The codec framework is wired in so that the default flips automatically when an ecosystem provider is installed. Users who want Zstd today can install zstdlite from source. Blosc remains available via the optionalbloscpackage.readSOMA()documented for CELLxGENE Census workflows. The existing cloud-URI support (passed through totiledbsoma::SOMAExperimentOpen, which uses S3 byte-range requests internally) is now demonstrated with a realistic Census slice in?readSOMA, with a pointer to thecellxgene.censusR package for release-version pinning. An opt-in integration test (SCCONVERT_TEST_SOMA=true) exercises the path end to end. No code change: this codifies and verifies behavior that was already present.readZarr()acceptss3://andgs://URLs. Public, anonymous buckets only: SigV4 signing for private S3 is not supported. The remote store is mirrored to a local directory before reading (uses the existing local-fs zarr reader). Whencache = TRUE(default), the download is kept undertools::R_user_dir("scConvert", "cache")and reused on subsequent calls with the same URL; setcache = FALSEfor a tempdir that is discarded with the R session. Requireshttrandxml2inSuggests(installed on demand).
Architectural changes
Cross-format converters no longer detour through
.h5seurat.H5MUToH5AD,H5ADToH5MU,LoomToH5AD, andH5ADToLoompreviously routed through a temporary.h5seuratfile with no streaming benefit – both branches materialized the Seurat object viareadH5Seurat(tmp)orreadH5AD(source)anyway. Thestreamparameter has been removed from these four functions; each now reads the source into a Seurat object and writes the destination directly.LoomToH5MUandH5MUToLoomkeep their.h5seuratintermediate, which is genuine HDF5 chunk streaming and avoids materializing large objects..h5ad_loaderno longer writes a temporary.h5seurat. The hub h5ad loader used byscConvert()for cross-format conversions (e.g.scConvert("x.h5ad", "x.rds")) now callsreadH5AD()directly instead of converting h5ad to h5seurat first. The standalonereadH5AD()was already direct; only hub-mediated paths went via h5seurat. Removes one full HDF5 read+write cycle per cross-format conversion from h5ad.
scConvert 0.2.1
New features
-
C CLI binary now ships on Windows. Previous releases built and bundled
scconvert(Linux + macOS) only, and Windows users silently fell through to the slower R streaming path. The package’s CI now compiles a Windowsscconvert.exeagainst MSYS2’s ucrt64 toolchain and bundles the four required runtime DLLs (hdf5.dll,zlib1.dll,libgcc_s_seh-1.dll,libwinpthread-1.dll) alongside the exe ininst/bin/. Windows resolves implicit-load DLLs from the exe’s own directory first, so users do not need MSYS2 or any HDF5 installation. Adds roughly 7 MB to the installed package on Windows.
Bug fixes
readH5AD()no longer aborts onclose_all()errors. The Windows CRAN binary of hdf5r 1.3.12 bundles HDF5 1.12.1, which refuses to close a file when any leaf-object ID is still held by the reference counter. The error propagated out ofon.exit(), surfaced asscConvert_cli(h5ad -> rds)returningFALSE, and broke a handful of Windows test expectations. The threeclose_all()sites inR/LoadH5AD.Rare now wrapped intryCatch(..., error = function(e) NULL), matching the pattern already used inR/WriteH5AD.R. The read has fully completed by the timeclose_all()runs; cleanup failure is best-effort..h5ad_loaderandreadH5Seurat.characterclose_all wrapped. The previous fix coveredreadH5AD()itself, but the hub h5ad loader (R/zzz.R:.h5ad_loader) and the h5Seurat character-method reader (R/LoadH5Seurat.R:readH5Seurat.character) still had unwrappedclose_all()calls. The hub h5ad loader is what powersscConvert(<file.h5ad>, dest = "<file.rds>")and thescConvert_cli(h5ad -> rds)fallback path. Both sites are now wrapped to match the LoadH5AD pattern; cleanup remains best-effort.CI: skip
h5ad -> rdsintegration tests on Windows HDF5 1.12.x. Wrappingclose_all()removes the explicit close error, but on the CRAN Windows hdf5r 1.3.12 binary (HDF5 1.12.1) the per-ID R6 finalizer also fires at GC viaprivate$closeFun(id)and reportsH5Fclose: decrementing file ID failed. These finalizer errors run outside any calling context and cannot be caught bytryCatch; R CMD check surfaces them as test errors. The two tests intests/testthat/test-cli-integration.Rthat unavoidably route through the hdf5r R-hub path (scConvert_cli: h5ad -> rdsand theh5ad -> rds -> h5adroundtrip — rds is not incli_formatsso the C binary can’t help) nowskip_ifwhen running on Windows with HDF5 1.12.x. Linux and macOS CI (HDF5 1.14+) keep full coverage.C reader: respect source string encoding (HDF5 >= 2.0 compat).
sc_get_str_attr,sc_get_str_array_attr,sc_copy_group_attrs, thecopy_attr_cbinsc_dataframe.c, the h5seurat-factor and h5ad-categorical readers, and the group-copy attribute paths insc_groups.call forcedsc_create_vlen_str_type()(UTF-8 vlen) as the H5Aread/H5Dread memtype. hdf5r writes vlen strings with CSET=ASCII; HDF5 >= 2.0 refuses the implicit ASCII<->UTF-8 conversion and the read silently fails. Symptom: on HDF5 2.x,readH5AD()dropped every obs metadata column pastnFeature_RNAbecause thecolumn-orderattribute read returned all empty strings. Fix: use the source attribute / dataset’s own datatype as the memtype and check H5Aread / H5Dread return values explicitly. CI did not catch this because the CI matrix uses HDF5 1.14, which permits the implicit conversion.writeH5AD: normalize obsm embedding orientation. Both
.writeH5AD_candDirectSeuratToH5AD()now defensively transpose reduction embeddings whose first dim doesn’t matchn_cells, and skip with a warning if neither dim matches. Prevents the C writer from emittingobsmdatasets with HDF5 shape(n_dims, n_cells), which failsanndata.read_h5ad(). Test:tests/testthat/test-regression-fixes.Rassertsf['obsm/X_pca'].shape == (n_cells, n_dims)via h5py for both writer paths.
scConvert 0.2.0
Release Date: 2026-05-04
Breaking changes
-
HDF5 1.14+ required.
SystemRequirementsbumped fromHDF5 (>= 1.10.0)toHDF5 (>= 1.14.0). The hdf5r close path segfaults on libhdf5 1.10.x when closing files with many open child IDs (groups/datasets opened via[[]]subsetting); this is upstream and not catchable in R. Modern Linux distros (Ubuntu 24.04+, Debian 12+, Fedora 38+), recent macOS packages, and Bioconductor’sRhdf5liball ship 1.14+, so this matches real deployments.
New features
Existing native readers (carried from development)
-
Native Stereo-seq GEF reader (
LoadStereoSeqGef()). Pure-R reader for BGI.gefand.cellbin.geffiles usinghdf5ronly. Handles both the square-bin and cell-bin schemas documented by STOmics. No Python, stereopy, or reticulate dependency. Spot coordinates are stored inmeta.data$spatial_x/yandmisc$spatial_technology = "StereoSeq". -
Native CosMx SMI reader (
LoadCosMx()). Thin R wrapper aroundSeurat::LoadNanostring()that validates the canonical flat-file bundle (*exprMat*.csv,*metadata*.csv,*fov_positions*.csv,*tx_file*.csv) and tags the result withmisc$spatial_technology = "CosMx". No squidpy or reticulate dependency. -
CLI auto-delegation for vendor raw formats. The
scconvertC binary now auto-detects.gef,.cellbin.gef, and CosMx bundle directories and transparently delegates to the R backend viaRscript+execvp(). Users can writescconvert mosta.gef mosta.h5addirectly.Rscriptlookup happens up-front; paths are absolutised; no shell-parsed command strings. -
FOV round-trip through h5ad.
writeH5AD()now serializesFOV@boundariesandFOV@moleculesinto a stableuns/spatial/{library}/segmentation/anduns/spatial/{library}/molecules/contract.readH5AD()automatically rebuilds any FOV library it finds on load. Backward-compatible with squidpy and scanpy (they ignore unknownuns/spatial/{lib}/children). -
CLI
varppreservation. The newsc_stream_varp()insrc/sc_groups.cmirrorssc_stream_obsp()and maps/varp/(h5ad) to/misc/__varp__/(h5seurat). Handles both sparse (CSR/CSC group) and dense (array dataset) varp entries. Closes the manuscript limitation “CLI does not preserve varp”. -
Loom factor-level preservation.
writeLoom()now stores factor levels andorderedflag under/scConvert_extensions/col_factor_levels/{name}, outsidecol_attrsso loompy/scanpy continue to read the file without errors.readLoom()restores the factors on load. -
h5mu per-modality
unsround-trip.writeH5MU()andreadH5MU()now mirror each modality’sunsgroup underobj@misc[["__h5mu_uns_per_mod__"]][[modality]]so per-modality uns entries survive h5mu round-trips instead of being flattened. -
IMC multi-image support.
SeuratSpatialToH5AD()now iterates over every image in deterministic sorted order instead of processing onlyImages()[1]. Fixes the IMC 14/15 -> 11/13 double-roundtrip degradation documented indev/NOTES.mdsection 3.
P0 robustness fixes (Codex review response, Part A)
-
SOMA / SpatialData generic dispatch. Lambdas registered for the
scConvertgeneric now acceptfilename =soscConvert.character()reaches the right method. Addstests/testthat/test-generic-dispatch.R. -
CLI build hygiene. CLI
.ofiles are isolated tosrc/cli_obj/;make -f Makefile.cli install-bincopies the binary toinst/bin/scconvert;sc_find_cli()prefers it over the source-treesrc/scconvert. -
Bounded-memory R sparse streaming.
.stream_sparse_group()now reads in 64 MiB chunks (tunable viaoptions("scConvert.stream_chunk_bytes")) instead of materialising whole sparse matrices. Addstests/testthat/test-stream-memory.R. -
Canonical h5mu layout on write.
writeH5MU()now writes the top-level/var(concat of modalities),/obsmap/{mod}(always0..n-1for Seurat sources), and/varmap/{mod}(block-diagonal with-1sentinels) in the muon convention. -
Atomic SOMA / SpatialData writes. Both writers now build under a sibling temp name and rename on success, so a mid-write crash leaves the user’s existing path untouched. (
writeSpatialData()initially shipped with anon.exitthat deleted its own freshly-renamed output; fixed in 2026-05-01 via awrite_succeededdisarm flag.) -
Python-validation tests in CI.
tests/scverse-env.yml+setup-micromambamaketests/testthat/test-python-validation.Rrunnable on the GitHub runner. Test no longer hardcodes the macOS conda path.
Robustness
-
C CLI memory-safety helpers. New
sc_xmalloc(),sc_xcalloc(),sc_xrealloc(), andsc_check_mul_size()insrc/sc_util.creplace rawmalloc()calls with overflow-checked allocations at the dense-embedding transpose sites insrc/sc_zarr.c:1043, 1668, 1674and the column-buffer allocation insrc/sc_loom.c:138. Prevents SIZE_MAX overflow on chip-scale embeddings. -
Defensive
close_allwrap on direct-path conversion. R/Convert.R wrapshfile$close_all()intryCatchfor HDF5 1.10.x graceful degradation. (1.14+ is required and tested; the wrap is no-op there.)
Bug fixes
-
Native Stereo-seq GEF reader (
LoadStereoSeqGef()). Pure-R reader for BGI.gefand.cellbin.geffiles usinghdf5ronly. Handles both the square-bin and cell-bin schemas documented by STOmics. No Python, stereopy, or reticulate dependency. Spot coordinates are stored inmeta.data$spatial_x/yandmisc$spatial_technology = "StereoSeq". -
Native CosMx SMI reader (
LoadCosMx()). Thin R wrapper aroundSeurat::LoadNanostring()that validates the canonical flat-file bundle (*exprMat*.csv,*metadata*.csv,*fov_positions*.csv,*tx_file*.csv) and tags the result withmisc$spatial_technology = "CosMx". No squidpy or reticulate dependency. -
CLI auto-delegation for vendor raw formats. The
scconvertC binary now auto-detects.gef,.cellbin.gef, and CosMx bundle directories and transparently delegates to the R backend viaRscript+execvp(). Users can writescconvert mosta.gef mosta.h5addirectly.Rscriptlookup happens up-front; paths are absolutised; no shell-parsed command strings. -
FOV round-trip through h5ad.
writeH5AD()now serializesFOV@boundariesandFOV@moleculesinto a stableuns/spatial/{library}/segmentation/anduns/spatial/{library}/molecules/contract.readH5AD()automatically rebuilds any FOV library it finds on load. Backward-compatible with squidpy and scanpy (they ignore unknownuns/spatial/{lib}/children). -
CLI
varppreservation. The newsc_stream_varp()insrc/sc_groups.cmirrorssc_stream_obsp()and maps/varp/(h5ad) to/misc/__varp__/(h5seurat). Handles both sparse (CSR/CSC group) and dense (array dataset) varp entries. Closes the manuscript limitation “CLI does not preserve varp”. -
Loom factor-level preservation.
writeLoom()now stores factor levels andorderedflag under/scConvert_extensions/col_factor_levels/{name}, outsidecol_attrsso loompy/scanpy continue to read the file without errors.readLoom()restores the factors on load. -
h5mu per-modality
unsround-trip.writeH5MU()andreadH5MU()now mirror each modality’sunsgroup underobj@misc[["__h5mu_uns_per_mod__"]][[modality]]so per-modality uns entries survive h5mu round-trips instead of being flattened. -
IMC multi-image support.
SeuratSpatialToH5AD()now iterates over every image in deterministic sorted order instead of processing onlyImages()[1]. Fixes the IMC 14/15 -> 11/13 double-roundtrip degradation documented indev/NOTES.mdsection 3.
Robustness
-
C CLI memory-safety helpers. New
sc_xmalloc(),sc_xcalloc(),sc_xrealloc(), andsc_check_mul_size()insrc/sc_util.creplace rawmalloc()calls with overflow-checked allocations at the dense-embedding transpose sites insrc/sc_zarr.c:1043, 1668, 1674and the column-buffer allocation insrc/sc_loom.c:138. Prevents SIZE_MAX overflow on chip-scale embeddings.
Bug fixes
-
readH5AD(): handle unsorted CSR column indices. scanpy-processed files such asscanpy.datasets.pbmc3k_processed()ship with CSR matrices whose column indices are not sorted within each row. Previously these produced an invaliddgCMatrixon read.readH5AD()andreadH5MU()now detect this condition via.sort_dgc_indices()inR/LoadH5AD.R(and the sibling helper inR/LoadH5MU.R) and sort indices column-wise before constructing the sparse matrix. A regression test has been added intests/testthat/test-regression-fixes.R. -
H5SeuratToZarr(): do not crash on 1D dense datasets. The direct h5Seurat -> Zarr converter previously attempted to infer (rows, cols) from a 1D dense HDF5 dataset and crashed with a dimensionality error. It now emits a warning and skips the offending dataset. Tracked as “Chain D” in the benchmark manuscript; regression test included. -
writeZarr(): skip scale.data layer when its shape differs from X.Seurat::ScaleData()produces an(n_hvg x n_cells)matrix whose row count does not matchX’sn_genes. AnnData requires alllayers/*to matchX’s shape, sowriteZarr()now skips any layer whose dimensions differ from the default assay’s data matrix. A regression test covers this case. (SeeR/SaveZarr.R:131.)
Testing
- Added
tests/testthat/test-regression-fixes.Rwith three regression tests pinning the bugs listed above. - Added
tests/testthat/test-generic-dispatch.R(SOMA/SpatialData lambda signature pinning). - Added
tests/testthat/test-stream-memory.R(bounded-memory verification on tiny chunk budgets). - Appended a canonical-h5mu-layout block to
tests/testthat/test-h5mu-multimodal.R. - Test suite is now 166
test_thatblocks with 773 assertions on macOS / Ubuntu / Windows (was 137 / 448 at 0.1.0). - 19 vignettes build cleanly under
tools::buildVignettes()including the 6 with live Python interop chunks via reticulate.
scConvert 0.1.0
Release Date: 2026-03-10
Highlights
Initial public release of scConvert — a universal single-cell format converter for R.
Universal Format Conversion
- Support for 7 formats: h5ad, h5Seurat, h5mu, Loom, Zarr, RDS, and SingleCellExperiment
- Hub architecture with 30+ conversion paths via
scConvert() - Direct HDF5 paths for h5ad/h5Seurat without intermediate loading
Direct h5ad Loading
-
readH5AD()for native h5ad-to-Seurat conversion without intermediate files - Sparse (CSR/CSC) and dense matrix support
- Categorical metadata, dimensional reductions, neighbor graphs, and spatial data
MuData (h5mu) Multimodal Support
-
readH5MU()/writeH5MU()for multimodal single-cell data - Automatic modality-to-assay name mapping (rna->RNA, prot->ADT, atac->ATAC)
- No MuDataSeurat or Python dependency required
Zarr AnnData Support
-
readZarr()andwriteZarr()for Zarr-based AnnData stores (v2 format) - Sparse CSR/CSC and dense matrix support
- Categorical metadata and dimensional reduction preservation
Spatial Data (Visium)
- Bidirectional Visium spatial data conversion with image reconstruction
- Proper coordinate handling and scale factor preservation
- Compatible with scanpy/squidpy spatial analysis workflows
C CLI Binary
- Standalone
scconvertbinary for h5ad/h5Seurat/h5mu conversions - Streaming on-disk conversion without R or Python runtime
- Options:
--assay,--gzip,--overwrite,--quiet