Benchmarks

This page documents the benchmarking suite and how to interpret the results. The benchmarks are designed for realistic tractography workloads (HPC scale), not for CI. They focus on file size, throughput, and interactive spatial queries.

Data

All benchmarks use a real dataset from the Human Connectome Project (Young Adult dataset). The original size of the tck file is 18.5GB. The streamlines were generated by tckgen and the data per vertex were generated by tcksift2. This data was converted to TRX format (float16 positions) with nibabel.

TRX size vs streamline count

This benchmark creates subsets of the reference TRX file and measures the on-disk size for different streamline counts. This also calculates the size with and without additional features such as DPS, DPV and groups.

TRX file size vs streamlines — File size (MB) as a function of streamline count.

Translate + stream write throughput

This benchmark simulates loading a TRX file, applying a spatial transform, and saving it to a new TRX file (preserving DPV, DPS, groups, etc.). The workflow is:

Load the input TRX file (decompresses to a temp directory).
Copy all metadata files (header, offsets) and data array directories (dps, dpv, groups) to a fresh output directory using kernel-level filesystem copies (fcopyfile on macOS, sendfile on Linux). The data never passes through the process’s mapped address space.
Release DPS/DPV mmaps immediately after the copy so their pages are not counted against the benchmark’s RSS.
Iterate through every position point, apply a +1 mm translation in x/y/z, and stream the translated positions directly to the output directory.
Pack the output directory into a .trx zip archive using zip_source_file (libzip buffered I/O), which reads files in chunks without mapping them into the process address space.

The reported max_rss_kb is the per-iteration RSS delta — the peak increase in resident memory observed during that single iteration, measured with mach_task_basic_info (macOS) or /proc/self/status VmRSS (Linux). This avoids contamination from previous benchmark iterations whose pages may still be resident.

Translate + stream write time — End-to-end time for translating and rewriting streamlines.

Translate + stream write RSS — Peak RSS delta during translate + stream write. Because DPS/DPV are copied at the filesystem level and their mmaps are released before position processing begins, only the positions chunk buffer (configurable via `TRX_BENCH_CHUNK_BYTES`, default 1 GB) and the output stream contribute to the measured RSS.

Spatial slab query latency

Streamlines need to be visualized. One common method for this is to break the tractogram into spatial slabs and query each slab individually. Both mrview and DSI Studio have methods for showing sliced streamlines in 2D.

The trx-cpp library provides a few methods for subsetting a large set of streamlines. This benchmark uses the Axis-aligned bounding box (AABB) method to create subsets of the tractogram.

After computing the AABBs, it issues 20 spatial queries using 5 mm slabs that sweep through the tractogram volume. Each slab query mimics a GUI slice update and records its timing so distributions can be visualized. To keep results representative of interactive use, each query returns at most 500 randomly sampled streamlines from all those intersecting the slab (configurable via TRX_BENCH_MAX_QUERY_STREAMLINES).

Slab query timings — Distribution of per-slab query latency.

Running the benchmarks

Build into build-release (the default expected by run_benchmarks.sh), then use the helper script to run all groups and generate plots:

cmake -S . -B build-release \
  -DCMAKE_BUILD_TYPE=Release \
  -DTRX_BUILD_BENCHMARKS=ON
cmake --build build-release --target bench_trx_realdata

# Run all benchmark groups (filesize, translate+write, query) in sequence.
./bench/run_benchmarks.sh

# Generate plots into docs/_static/benchmarks.
Rscript bench/plot_bench.R --bench-dir bench --out-dir docs/_static/benchmarks

run_benchmarks.sh accepts the following flags:

Flag	Description
`--reference PATH`	Path to the reference `.trx` file (required).
`--out-dir DIR`	Output directory for JSON results (default: `bench/`).
`--build-dir DIR`	Path to the CMake build directory (default: `build-release`).
`--profile core\|full`	`core` (default) skips DPV and compression for large streamline counts to keep run time manageable; `full` runs every combination.
`--verbose`	Enable per-benchmark progress logging.

Environment variables (set before calling run_benchmarks.sh or the binary directly):

Variable	Description
`TRX_BENCH_BUFFER_MULTIPLIER`	Scale I/O buffer sizes for slow storage (HDD, NFS). Default: `1`.
`TRX_BENCH_CHUNK_BYTES`	Positions chunk size for translate+write (default: `1073741824` = 1 GB).
`TRX_BENCH_MAX_STREAMLINES`	Cap the maximum streamline count tested (default: `10000000`).
`TRX_QUERY_TIMINGS_PATH`	If set, per-slab query latencies are appended as JSONL to this path.
`TRX_RSS_SAMPLES_PATH`	If set, time-series RSS samples during file-size runs are appended as JSONL.

Example — slower storage with a reduced streamline cap:

TRX_BENCH_BUFFER_MULTIPLIER=4 \
TRX_BENCH_MAX_STREAMLINES=5000000 \
  ./bench/run_benchmarks.sh \
    --reference /path/to/reference.trx \
    --out-dir bench/results_hdd

RSS measurement

The max_rss_kb counter in every benchmark reports the per-iteration RSS delta: the maximum increase in resident memory observed during a single benchmark iteration, sampled at the start and end of each iteration. This avoids contamination from earlier iterations whose pages may still be resident in the OS page cache.

On macOS the current RSS is read via mach_task_basic_info.resident_size; on Linux via VmRSS in /proc/self/status. The run_benchmarks.sh script runs each benchmark group (filesize, translate, query) as a separate process to prevent cross-group RSS accumulation.