A command-line utility for working with BINSEQ files.
bqtools provides tools to encode, decode, manipulate, and analyze BINSEQ files.
It supports all BINSEQ variants (*.bq, *.cbq, *.vbq) and makes use of the binseq library.
BINSEQ is a binary file format family designed for high-performance processing of DNA sequences. It currently has two variants: BQ and VBQ.
- BQ (*.bq): Optimized for fixed-length DNA sequences without quality scores (2bit/4bit).
- VBQ (*.vbq): Optimized for variable-length DNA sequences with optional quality scores, headers with 2bit/4bit.
- CBQ (*.cbq): Optimized for variable-length DNA sequences with optional quality scores, headers with 2bit + N.
All support single and paired sequences and make use of two-bit or four-bit encoding for efficient nucleotide packing using bitnuc and efficient parallel FASTX processing using paraseq.
For more information about BINSEQ, see our paper where we describe the format family, applications, and benchmark against other sequencing formats.
TL;DR:
*.cbqis the recommended format for most applications.
For most applications the BINSEQ variant of choice is *.cbq.
This format is lossless by default and supports variable-length sequences.
It achieves better compression than *.vbq and *.bq by using blocked-columnar compression of sequence attributes.
It can optionally exclude quality scores and headers (but they are included by default).
For an overview of the format check out the BINSEQ docs.
If your application only requires sequences and has fixed-length reads then *.bq is the best choice.
It is the fastest variant but is lossy by design.
Note:
*.vbqwas originally designed for variable-length sequences with quality scores and headers, but it is now deprecated in favor of*.cbqwhich is more compressable, lossless, and has faster decoding.
- Encode: Convert FASTA or FASTQ files to a BINSEQ format
- Decode: Convert a BINSEQ file back to FASTA, FASTQ, or TSV format
- Cat: Concatenate multiple BINSEQ files
- Info: Show information and statistics about a BINSEQ file.
- Grep: Search for fixed-string, regex, or fuzzy matches in BINSEQ files.
- Split: Split a BINSEQ file into multiple files based on matching patterns.
- Pipe: Create named-pipes for efficient data processing with legacy tools that don't support BINSEQ, optionally spawning and supervising the consumer commands directly (
-x/-X).
bqtools can be installed using cargo, the Rust package manager:
cargo install bqtoolsTo install cargo you can follow the instructions on the official Rust website.
# Clone the repository
git clone https://github.com/arcinstitute/bqtools.git
cd bqtools
# Install
cargo install --path .
# Check installation
bqtools --helpbqtools supports the following feature flags:
htslib: Enable support for reading SAM/BAM/CRAM files using thehtsliblibrary (default).gcs: Enable support for reading Google Cloud Storage files.fuzzy: Enable fuzzy matching in thegrepcommand using thesassylibrary
To enable fuzzy matching, bqtools must be compiled using a native target cpu:
# Install from source
cargo install --path . -F fuzzy;
# Or install from crates but enforce native target cpu
export RUSTFLAGS="-C target-cpu=native"; cargo install bqtools -F fuzzy;To selectively enable/disable feature flags:
# (for fuzzy matching support sassy requires native target cpu)
export RUSTFLAGS="-C target-cpu=native";
# Install bqtools without htslib/gcs but with fuzzy matching
cargo install bqtools --no-default-features -F fuzzy
#
# Install bqtools without htslib but with fuzzy matching and gcs
cargo install bqtools --no-default-features -F fuzzy,gcs# Get help information
bqtools --help
# Get help for specific commands
bqtools encode --help
bqtools decode --help
bqtools cat --help
bqtools info --help
bqtools grep --help
bqtools split --help
bqtools pipe --help
bqtools qc --helpbqtools accepts input from stdin or from file paths.
It will auto-determine the input format and compression status.
Convert FASTA/FASTQ files to BINSEQ:
# Encode a single file to bq
bqtools encode input.fastq -o output.bq
# Encode a single file to vbq
bqtools encode input.fastq -o output.vbq
# Encode a single file to vbq with 4bit encoding
bqtools encode input.fastq -o output.vbq -S4
# Encode a file stream to bq (auto-determine input format and compression status)
/bin/cat input.fastq.zst | bqtools encode -o output.bq
# Encode paired-end reads
bqtools encode input_R1.fastq input_R2.fastq -o output.bq
# Encode paired-end reads to vbq
bqtools encode input_R1.fastq input_R2.fastq -o output.vbq
# Encode a SAM/BAM/CRAM file to BINSEQ
bqtools encode input.bam -fb -o output.bq
# Encode an paired-end CRAM file to BINSEQ (sorted by read name)
bqtools encode input.paired.cram -I -fb -o output.vbq
# Specify a policy for handling non-ATCG nucleotides (2-bit only)
bqtools encode input.fastq -o output.bq -p r # Randomly draw A/C/G/T for each N
# Set threads for parallel processing
bqtools encode input.fastq -o output.bq -T 4
# Include sequencing headers in the encoding (unused by .bq)
bqtools encode input.fastq -o output.vbq -H
# Encode with ARCHIVE mode (useful for genomes, cDNA libraries, and larger sequences)
# where there are common Ns, large sequence sizes, and headers are important
bqtools encode input.fasta -o output.vbq -AAvailable policies for handling non-ATCG nucleotides:
i: Ignore sequences with non-ATCG charactersp: Break on invalid sequencesr: Randomly draw a nucleotide for each N (default)a: Set all Ns to Ac: Set all Ns to Cg: Set all Ns to Gt: Set all Ns to T
Note: These are only applied when encoding with 2-bit.
Encoding FASTX files into BINSEQ is often IO-bound per-file and won't benefit much from parallelism.
However, file-level parallelism is still possible.
bqtools provides some options for making use of file-level parallelism by encoding into separate BINSEQ files or encoding many FASTX files into a single BINSEQ file.
bqtools will automatically find the pairs in the input files and respect pairing if the --paired flag is used.
To encode everything into a single BINSEQ file you can use the --collate flag.
# encodes all FASTX files into separate BINSEQ files
bqtools encode /path/to/fastx/*.fastq.gz
# encodes all paired FASTX files into separated paired-BINSEQ files
bqtools encode /path/to/fastx/*.fastq.gz --paired
# encodes all FASTX files into a single BINSEQ file
bqtools encode /path/to/fastx/*.fastq.gz -o some.vbq --collate
# encodes all FASTX files into a single paired-BINSEQ file
bqtools encode /path/to/fastx/*.fastq.gz -o some.vbq --collate --pairedYou might have a directory or nested subdirectories with multiple FASTX files or FASTX file pairs.
bqtools makes use of the efficient walkdir crate to recursively identify all FASTX files with various compression formats.
It will then balance the provided file/file pairs among the thread pool to ensure efficient parallel encoding.
All options provided by bqtools encode will be passed through to the sub-encoders.
# Encode all FASTX files as CBQ
bqtools encode --recursive --mode cbq ./
# Encode all paired FASTX files as VBQ
bqtools encode --recursive --paired --mode vbq ./
# Encode as BQ recursively with a max-subdirectory depth of 2
bqtools encode --recursive --mode bq --depth 2 ./Convert BINSEQ files back to FASTA/FASTQ/TSV:
# Decode to FASTQ (default)
bqtools decode input.bq -o output.fastq
# Decode to compressed FASTQ (gzip/zstd)
bqtools decode input.bq -o output.fastq.gz
bqtools decode input.bq -o output.fastq.zst
# Decode to FASTA
bqtools decode input.bq -o output.fa -f a
# Decode paired-end reads into separate files
bqtools decode input.bq --prefix output
# Creates output_R1.fastq and output_R2.fastq
# Specify which read of a pair to output
bqtools decode input.bq -o output.fastq -m 1 # Only first read
bqtools decode input.bq -o output.fastq -m 2 # Only second read
# Specify output format
bqtools decode input.bq -o output.tsv -f t # TSV formatCombine multiple BINSEQ files:
bqtools cat file1.bq file2.bq file3.bq -o combined.bqShow information and statistics about a BINSEQ file.
bqtools info input.cbq
# print out the VBQ index
bqtools info input.vbq --show-index
# print out the CBQ block headers
bqtools info input.cbq --show-headers
# export as json
bqtools info input.cbq --jsonNote: using
infowithout the--jsonflag will format the number of records to include underscores to delimit the thousands. To avoid this behavior or to pass raw numerical values forward use the--jsonflag.
You can easily search for specific subsequences or regular expressions within BINSEQ files:
By default the multiple pattern logic is AND (i.e. all patterns must match).
The logic can be changed to OR (i.e. any pattern must match) with the --or-logic option.
# See full options list
bqtools grep --help
# Search for a specific regex in either sequence
bqtools grep input.bq "ACGT[AC]TCCA"
# Search for a specific subsequence (in primary sequence)
bqtools grep input.bq -r "ATCG"
# Search for a regular expression (in extended)
bqtools grep input.bq -R "AT[CG]"
# Search for multiple regular expressions in either
bqtools grep input.bq "ACGT[AG]TCCA" "AG(TTTT|CCCC)A"
# Search for multiple regular expressions (OR-logic)
bqtools grep input.bq "ACGT[AG]TCCA" "AG(TTTT|CCCC)A" --or-logic
# Only search for patterns within a specified range per sequence (basepairs 30-80)
bqtools grep input.bq "ACGT[AG]TCCA" --range 30..80
# Only search for patterns within a specified range per sequence (basepairs 0-80)
bqtools grep input.bq "ACGT[AG]TCCA" --range ..80
# Only search for patterns within a specified range per sequence (basepairs 80-max)
bqtools grep input.bq "ACGT[AG]TCCA" --range 80..bqtools also support fuzzy matching by making use of sassy.
This requires installing using the fuzzy feature flag (see installation above):
# Run grep with fuzzy matching (-z)
bqtools grep input.bq "ACGTACGT" -z
# Run fuzzy matching with an edit distance of 2
bqtools grep input.bq "ACGTACGT" -z -k2
# Run fuzzy matching but only write inexact matches
bqtools grep input.bq "ACGTACGT" -zibqtools can also handle a large collection of patterns which can be provided on the CLI as a file.
Pattern files can be plain text (one pattern per line), FASTA (sequences are used as patterns), or TSV (two columns: alias and pattern). The format is auto-detected.
For FASTA and TSV files the header/alias is used as the pattern name in output; plain text patterns use the pattern string itself.
You can provide files for either primary/extended, just primary, or just extended patterns with the relevant flags.
Notably this will match solely with OR logic.
This can be used also with fuzzy matching as well as with pattern counting described below.
Regex is also fully supported and files can be additionally paired with CLI arguments.
If your patterns are all fixed strings (and not regex), you can improve performance by using the -x/--fixed flag.
This will use the more efficient Aho-Corasick algorithm to match patterns.
# Run grep with patterns from a plain text file (one pattern per line)
bqtools grep input.bq --file patterns.txt
# Run grep with patterns from a FASTA file (sequences used as patterns)
bqtools grep input.bq --file patterns.fa
# Run grep with patterns from a file (primary)
bqtools grep input.bq --sfile patterns.txt
# Run grep with patterns from a file (extended)
bqtools grep input.bq --xfile patterns.txt
# Run grep with fixed-string patterns from a file
bqtools grep input.bq --file patterns.txt -xYou can count the number of matching records with -C or get the fraction of matching records with --frac:
# Count the number of matching records
bqtools grep input.bq "ACGTACGT" -C
# Count matching records and show fraction of total
bqtools grep input.bq "ACGTACGT" -FThe output of --frac is a TSV with three columns: [Count, Total, Fraction]
bqtools also introduces a new feature for the counting the occurrences of individual patterns.
This is useful for seeing how many times each pattern occurs across a sequencing dataset without having to iterate over the dataset multiple times using traditional methods.
Some important notes are:
- A pattern will only be counted once across a sequencing record (primary and secondary)
- A sequencing record may contribute to multiple patterns occurrences
- Providing multiple patterns will match records with
ORlogic (this is different behavior frombqtools grepdefault which usesANDlogic when multiple patterns are provided) - Regular expressions are supported and treated as a single pattern (e.g.
ACGT|TCGAwill return a single output row but match on bothACGTandTCGA). - Invert is supported for counting patterns and will return the number of records a pattern does not occur in.
If your patterns are all fixed strings (and not regex), you can improve performance by using the -x/--fixed flag.
This will use the more efficient Aho-Corasick algorithm to match patterns.
The throughput gains for this can be massive for pattern counting, especially when dealing with high numbers of patterns.
# Count the number of occurrences for each of three expressions
bqtools grep input.bq "ACGTACGT" "TCGATCGA$" "AAA(TT|CC)AAA" -P
# Count the number of occurrences for each of three patterns with fuzzy matching
bqtools grep input.bq "ACGTACGT" "TCGATCGA" "AAAAAAAA" -Pz
# Count the number of records a pattern does not occur in
bqtools grep input.bq "ACGTACGT" "TCGATCGA" "AAAAAAAA" -Pv
# Count the number of occurrences for each pattern from a file
bqtools grep input.bq --file patterns.txt -P
# Count the number of occurrences for each pattern from a file (fixed strings)
bqtools grep input.bq --file patterns.txt -PxThe output of pattern count is a TSV with three columns: [Name, Count, Fraction of Total]. When patterns are loaded from a FASTA or TSV file, the header/alias is used as the name; otherwise, the pattern string itself is used.
# Count patterns from a FASTA file (names column shows FASTA headers)
bqtools grep input.bq --file patterns.fa -PSplit a BINSEQ file into separate files based on which pattern each record matches.
Patterns are provided through the same pattern files as grep (plain text, FASTA, or TSV with alias/sequence).
Each output file is named after the pattern alias, and records matching no pattern are written to an unmatched file.
A record is only written when it matches exactly one alias; ambiguous records (matching multiple aliases) are treated as unmatched.
Like grep, the backend is auto-selected: fixed-string patterns use Aho-Corasick (or force with -x/--fixed), regex patterns use the regex backend, and -z/--fuzzy enables fuzzy matching.
# See full options list
bqtools split --help
# Split into per-pattern files (named by FASTA header / TSV alias)
bqtools split input.cbq --file patterns.tsv
# Write outputs to a specific directory
bqtools split input.cbq --file patterns.tsv --basepath ./split_outs
# Split on primary or extended sequence patterns
bqtools split input.cbq --sfile primary.fa
bqtools split input.cbq --xfile extended.fa
# Force fixed-string (Aho-Corasick) matching
bqtools split input.cbq --file patterns.tsv -x
# Split with fuzzy matching (edit distance of 2)
bqtools split input.cbq --file patterns.fa -z -k2
# Skip writing the unmatched file
bqtools split input.cbq --file patterns.tsv --skip-unmatchedOutput files with fewer than a minimum number of records are removed (defaults to 1, dropping empty files).
Use --min-records N to raise the threshold, or --min-records 0 to keep all files.
# Only keep output files with at least 100 records
bqtools split input.cbq --file patterns.tsv --min-records 100
# Keep all output files, including empty ones
bqtools split input.cbq --file patterns.tsv --min-records 0Stream BINSEQ data to legacy tools through named pipes for parallel processing.
Because BINSEQ is a new format, many tools don't support it yet.
bqtools pipe creates a server that splits a BINSEQ file into multiple named pipes,
enabling parallel processing with tools that expect FASTQ/FASTA files.
Importantly, if your tool supports multiple parallel threads (i.e. parallelizes input files), you can make use of this feature to significantly improve performance.
# Create 4 named pipes (8 files for paired-end data, 4 files for single-end data)
# Pipes (single): fifo_[1234].fq
# Pipes (paired): fifo_[0123]_R[12].fq
bqtools pipe input.vbq -p 4 -b fifo &
# Process in parallel with tools that don't support BINSEQ
ls fifo_*.fq | xargs -P 4 -I {} sh -c 'legacy-tool {} > {.}.out'Managing FIFOs by hand (backgrounding the server, globbing paths, cleaning up)
is error-prone. The -x/--exec and -X/--exec-batch flags let bqtools pipe
spawn the consumer processes for you, wire them up to the FIFOs, and wait for
them to finish before tearing everything down.
-x / --exec runs one shell command per pipe, substituting these tokens:
| Token | Expands to |
|---|---|
{} |
the FIFO path (single-end) |
{R1} |
the R1 FIFO path (paired-end) |
{R2} |
the R2 FIFO path (paired-end) |
{n} |
the pipe index (0, 1, …) for per-shard output |
# Single-end: one `legacy-tool` invocation per pipe, in parallel
bqtools pipe input.cbq -p 4 -x 'legacy-tool {} > shard_{n}.out'
# Paired-end: each invocation receives its own R1/R2 pair
bqtools pipe paired.cbq -p 4 -x 'legacy-tool --in1 {R1} --in2 {R2} -o out_{n}.bam'
# Process only one mate by referencing just {R1} (R2 FIFOs are never created)
bqtools pipe paired.cbq -p 4 -x 'legacy-tool {R1} > r1_{n}.out'-X / --exec-batch runs a single command, substituting a space-joined
list of all FIFO paths. This suits tools that accept many input files as
positional arguments and parallelize internally.
# Single-end: all FIFO paths joined into one argument list
bqtools pipe input.cbq -p 4 -X 'legacy-tool {} > merged.out'
# Paired-end: {R1} and {R2} each expand to their full list
bqtools pipe paired.cbq -p 4 -X 'legacy-tool --in1 {R1} --in2 {R2}'In batch mode, writing {R1} {R2} adjacent in the template interleaves the
paths as pairs (r1_0 r2_0 r1_1 r2_1 …) so positional-argument tools receive
each pair together. When the tokens appear separately, each expands to its own
contiguous list.
Notes:
-xand-Xare mutually exclusive.- The template is validated up front (
{}for single-end, at least one of{R1}/{R2}for paired-end) so a malformed template fails fast instead of leaving an unread FIFO open. bqtools pipeexits non-zero if any spawned command exits non-zero.{n}only applies to-x; it has no meaning in-X(a single invocation).
Key features:
- Each pipe streams a portion of the BINSEQ file sequentially
- No disk I/O for intermediate files - data flows through memory
- Automatic paired-end handling (
_R1/_R2pairs) - Optionally spawn and supervise consumer commands with
-x/-X - Blocks until all pipes are fully read (prevents data loss)
- Auto-scales to CPU count with
-p0(default) - Pipes can be read sequentially or in parallel without blocking.
Note: This feature is not available on Windows.
Run FastQC-inspired quality control on a BINSEQ file and write a Markdown summary report plus per-module TSV files to an output directory.
# Run all QC modules with default settings
bqtools qc input.cbq
# Write results to a specific directory (default: ./bqtools-qc)
bqtools qc input.cbq -o qc-results
# Only QC a span of records
bqtools qc input.cbq --span 0..100000
# Skip specific modules
bqtools qc input.cbq --skip-dup-levels --skip-overrepresented
# Set the number of leading records sampled for duplication-level and
# overrepresented-sequence estimation (0 uses all records)
bqtools qc input.cbq --dup-sample-size 50000
# Set the minimum percentage of sampled reads a sequence must represent
# to be flagged as overrepresented
bqtools qc input.cbq --overrepresented-threshold 0.5
# Set threads for parallel processing
bqtools qc input.cbq -T 8Modules (each toggled off independently with a --skip-* flag):
- Per-base sequence quality (
--skip-base-qual) - Per-sequence quality (
--skip-seq-qual) - Per-base sequence content (
--skip-base-content) - Per-sequence GC content (
--skip-seq-gc) - Sequence length distribution (
--skip-seq-length) - Sequence duplication levels (
--skip-dup-levels) - Overrepresented sequences (
--skip-overrepresented)
Output directory contents:
summary.md— overview table (read/pair counts) and a headline section per enabled modulebase_quality_R1.tsv/base_quality_R2.tsvseq_quality_R1.tsv/seq_quality_R2.tsvbase_content_R1.tsv/base_content_R2.tsvgc_content_R1.tsv/gc_content_R2.tsvseq_length_R1.tsv/seq_length_R2.tsvduplication_levels_R1.tsv/duplication_levels_R2.tsvoverrepresented_sequences_R1.tsv/overrepresented_sequences_R2.tsv
For paired-end input, each module writes separate _R1/_R2 files and the
summary report splits its section into ### R1/### R2 subsections;
single-end input only produces the _R1 files and an unsplit section.
Teyssier N, Dobin A (2026) BINSEQ: A family of high-performance binary formats for nucleotide sequences. PLoS Comput Biol 22(5): e1014181. https://doi.org/10.1371/journal.pcbi.1014181