Frequently Asked Questions¶

General¶

What is MetaGraph?¶

MetaGraph is a framework for scalable construction and querying of very large annotated de Bruijn graphs. It uses succinct representations to index sequence collections efficiently, requiring only 2-4 bits per k-mer.

Key capabilities:

Index trillions of k-mers with minimal memory
Query sequences with exact matching or alignment
Annotate k-mers with labels, counts, or coordinates
Support DNA, RNA, and protein sequences

When should I use MetaGraph?¶

Good for:

Large-scale sequence search (millions of samples)
Metagenomic classification
Pangenome analysis
Multi-sample comparison

Not ideal for:

Single small datasets (MetaGraph might be overkill – use BLAST)
Full alignment profile or exact alignment (use dedicated aligners)

Installation¶

Which installation method?¶

Conda (easiest): conda install -c bioconda metagraph
Docker (isolated): docker run ghcr.io/ratschlab/metagraph:master
Source (for custom alphabets or latest features)

See Installation.

Which alphabet?¶

DNA (default): DNA/RNA sequences (A,C,G,T)
DNA5: With ambiguous bases (A,C,G,T,N)
DNA_CASE_SENSITIVE: When case matters (A,C,G,T,N,a,c,g,t)
Protein: Amino acid sequences (A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,Z,X)

Graph Construction¶

What k-mer size?¶

General rules:

DNA/RNA: k=21-41 (default: 31)
Protein: k=7-15 (default: 10)
Longer k = more specific, less sensitive
Must be ≤ sequence/read length

How many threads to use?¶

Most of the routines in MetaGraph are well parallelized, so the more threads, the better. More than the number of cores is not recommended.

What buffer size?¶

When constructing very large graphs, it is recommended to use disk swap to limit the RAM usage:

metagraph build --disk-swap <TMP_DIR> --mem-cap-gb 10

For small graphs: 1 GB is enough (Default)
For large graphs: the more, the better (50-80 GB is always sufficient, even for trillions of k-mers)
Larger than 80 GB is not recommended
<TMP_DIR> may be on a spinning disk (SSD might help but not necessary)

Canonical vs Primary graph?¶

Workflow:

Build canonical from reads (contains k-mer + reverse complement)
Extract primary contigs: metagraph transform --to-fasta --primary-kmers
Build primary graph (half the size, same information)

Important: Use primary graphs with RowDiff annotations for best compression.

How to handle large datasets?¶

Use disk swap:

metagraph build -k 31 --disk-swap /tmp --mem-cap-gb 50 input.fa

Build sample graphs separately:

# Per sample
for sample in samples/*.fasta.gz; do
   metagraph build -k 31 --mode basic -o $sample $sample
   metagraph transform --to-fasta --primary-kmers -o ${sample}.contigs ${sample}.dbg
done

# Joint graph from all contigs
ls samples/*.contigs.fasta.gz | metagraph build -k 31 --mode canonical -o joint

# Extract primary contigs from the joint graph
metagraph transform --to-fasta --primary-kmers -o joint_contigs_primary joint.dbg

# Joint primary graph from all contigs
metagraph build -k 31 --mode primary -o joint_primary joint_contigs_primary.fasta.gz

Extremely large graphs:

Extremely large succinct graphs can be constructed by building their parts separately and writing them to disk on the fly with flag --inplace. In such cases, don’t forget to index suffix ranges afterwards with metagraph transform --index-ranges ....

Annotation¶

Which annotation type?¶

By scale:

In most scenarios, start with: column (default)
Fast and easy to construct: row_flat
Fast queries and small: row_diff_flat
Very large scale: row_diff_brwt (best compression)

By feature:

K-mer counts: int_brwt or row_diff_int_brwt
Coordinates: brwt_coord or row_diff_brwt_coord

How to annotate with counts?¶

Recommended (for large data):

# Build weighted sample graph
metagraph build -k 31 --count-kmers -o sample.graph sample.fasta.gz

# Extract contigs with counts
metagraph transform --to-fasta -o contigs sample.graph.dbg
# Creates: contigs.fasta.gz + contigs.kmer_counts.gz

# Annotate joint graph
metagraph annotate -i joint.dbg --count-kmers \
    --anno-filename -o annotation contigs.fasta.gz

Simple (not recommended for large data):

metagraph annotate -i graph.dbg --count-kmers \
    --anno-filename -o annotation input.fa

What if my k-mer counts are very large?

Pass the --count-width flag to specify the number of bits used to represent the counts. Default is 8 bits (max 255). E.g., with 12 bits, the max count is 4095:

metagraph build -k 31 --count-kmers --count-width 12 -o sample.graph sample.fasta.gz

metagraph transform --to-fasta -o contigs sample.graph.dbg

metagraph annotate -i joint.dbg --count-kmers --count-width 12 \
    --anno-filename -o annotation contigs.fasta.gz

Querying¶

How to query?¶

Command line:

# Presence/absence
metagraph query -i graph.dbg -a annotation.annodbg query.fa

# With threshold (80% k-mers must match)
metagraph query --discovery-fraction 0.8 ...

# K-mer counts
metagraph query --query-mode counts -a annotation.int_brwt.annodbg ...

# Coordinates
metagraph query --query-mode coords -a annotation.brwt_coord.annodbg ...

Python API:

Start metagraph in the server mode:

metagraph server_query -i graph.dbg -a annotation.annodbg --port 5555 -p 10 --mmap

Query the index:

from metagraph.client import GraphClient

client = GraphClient('localhost', 5555, api_path='')
results = client.search('ACGTACGT', discovery_fraction=0.8)

What is discovery_fraction?¶

Minimum fraction of k-mers that must match:

0.0: Any match (at last 1 k-mer must match)
0.8: ≥80% k-mers match (recommended for classification)
1.0: All k-mers match (strict)

How to align?¶

Exact k-mer matching:

metagraph align --map -i graph.dbg reads.fa

Sequence-to-graph alignment:

metagraph align -i graph.dbg reads.fa

# With annotation (label-consistent paths)
metagraph align -i graph.dbg -a annotation.annodbg reads.fa

# Adjust sensitivity
metagraph align --align-min-seed-length 15 --min-exact-match 0.7 ...

Performance¶

Memory requirements?¶

Graph:

Default construction: for k=31, at least 16 bytes per k-mer plus overhead (1B k-mers ≈ 18 GB)
Construction with disk swap: buffer size plus output graph size
Succinct (stat): ~4 bits/k-mer (10B k-mers ≈ 5 GB)
Succinct (small): ~2 bits/k-mer (10B k-mers ≈ 2.5 GB)

Large-scale indexing:

When indexing large-scale datasets, everything depends on the chosen buffer sizes. With carefully selected parameters, one can build and annotate graphs with trillion of k-mers. For real examples, see https://github.com/ratschlab/metagraph/blob/master/metagraph/experiments/large_index_scripts.md.

How to speed up construction?¶

Use more threads: -p 64
Pre-process the input sequences with KMC: kmc -k31 -m40 -sm input.fasta.gz output /tmp/
Parallelize samples: Build sample graphs in parallel, with 4-8 threads (-p ...) on each.

How to reduce size?¶

Use primary graph: 50% smaller than canonical (only applies to indexing reads)
Use RowDiff<Multi-BRWT> for annotation (10-20% of Column)
Relax BRWT: metagraph relax_brwt --relax-arity 32 ... (5-20% smaller)
Transform graph: metagraph transform --state small ...
Filter out low-abundance k-mers: kmc -ci2 ... before building to remove singleton k-mers
Use other graph cleaning techniques: metagraph clean ... to remove sequencing errors

How to reduce RAM usage while querying?¶

Use memory mapping (–mmap): see Memory mapping
Reduce the batch size: see ./metagraph query --batch-size ...
Alternative: Use disk-backed formats: row_disk or row_diff_disk

When to use memory mapping?¶

When the graphs are extremely large
When the available RAM is limited

When not to use memory mapping?¶

Never use it with slow (e.g., spinning) disks, unless it’s for a single stats check

Troubleshooting¶

Installation fails¶

Conda: Try installing in a fresh environment

Docker: Pull latest: docker pull ghcr.io/ratschlab/metagraph:master

Source: Check compiler version and the dependencies

See Installation and Troubleshooting.

Out of memory¶

Solution: Use disk swap

metagraph build --disk-swap /tmp --mem-cap-gb 50 ...
metagraph annotate --disk-swap /tmp --mem-cap-gb 10 ...

RowDiff transform generates no output¶

Solution: Remember that for annotations with coordinates and counts, the output files have the same name *.column.annodbg as the input files, hence, should be written to a new directory. (See Convert count-aware annotations.)