Help

Search DNA, RNA, and protein sequences across public archives using annotated de Bruijn graphs with exact matching or sensitive alignment.

Contact & support

Report issues or request features:

Search now

Quick start

Paste sequence — Paste FASTA or upload a .fasta/.fa file on the Search page. The UI validates FASTA and shows size/sequence count.
Pick databases — Choose from INSDC subsets (Microbe, Fungi, Plants, Metazoa incl. Human/Mouse), reference/assembled sets (RefSeq, UHGG, Tara Oceans), or proteins (UniParc).
Run search — Start with Exact match; switch to Alignment for lower‑identity or noisy sequences.

Web limits: up to 10 sequences per web query. For larger batches, use the web Application Programming Interface (API) or Command Line Interface (CLI).

Search now Docs Examples

How‑to guides

DNA/RNA/protein, filters, and database choices

DNA / RNA / protein searches

DNA/RNA: Use nucleotide FASTA. Start with Exact; if identity is low or variants are expected, toggle Alignment.
Proteins: Use the UniParc index for amino‑acid sequences.

Filters & parameters

Mode: Exact (fast) vs Alignment (sensitive).
Top labels: Limit the number of accessions returned.
Discovery threshold: The minimum fraction of k-mers that must match. Lower thresholds (lenient) return more hits; higher thresholds (strict) return fewer, higher-confidence hits. Adjust this parameter to balance sensitivity and specificity.

Choosing databases

See the Databases page for live coverage and whether an index includes counts or coordinates.

View databases

Interpreting results

Understanding your MetaGraph search results

Results Overview

MetaGraph returns hits organized by database and accession. Similar to BLAST, results are ranked by relevance, but instead of E-values, MetaGraph uses discovery threshold and k-mer matching to identify significant matches.

Key Result Metrics

Identity — Fraction of query bases identical in the matched path. In alignment mode, this shows the percentage of exact matches between your query and the target sequence. Higher identity (closer to 100%) indicates a closer match.
Coverage — Fraction of the query covered by matches/alignments. This tells you how much of your input sequence was found in the database. For RNA-seq data with counts, coverage approximates expression profiles across samples.
Discovery Threshold — The minimum fraction of k-mers that must match. Lower thresholds (lenient) return more hits; higher thresholds (strict) return fewer, higher-confidence hits. Adjust this parameter to balance sensitivity and specificity.
Matched Accessions / Labels — Each hit corresponds to a specific sample, genome, or protein record in the database. Click on accession numbers to view the original sequence record (when available).

Additional Information (Database-Dependent)

Counts — Available for some databases. Shows per-k-mer/sample abundances, useful for quantifying expression levels in RNA-seq or metagenomic data. Higher counts suggest higher abundance in the source sample.
Coordinates — Available for reference genome databases (e.g., RefSeq). Shows genomic positions where your query matches, enabling precise localization on assembled chromosomes or contigs.
Taxonomy — Provided when databases are organized by taxonomic classification. Helps identify which organisms or taxonomic groups contain your query sequence. Use the AI Organism Identification feature for automated species inference.

Result Table Navigation

Sort columns — Click column headers to sort by identity, coverage, or other metrics.
Filter results — Use the search box to filter by accession name or taxonomic group.
Pagination — Navigate through pages if many hits are found. Adjust rows per page (10, 25, 50, 100) for convenience.
Export — Download results as CSV or JSON for further analysis.

No Results Found?

If your search returns no hits, consider:

Lowering the discovery threshold — Try 0.5 or lower for more lenient matching.
Selecting different databases — Your sequence might be present in databases you haven't searched.
Checking sequence type — Ensure you've selected "Nucleotide" or "Amino acid" correctly.
Verifying FASTA format — Make sure your input is properly formatted (see Quick start section).

Exporting & sharing

Download CSV: Export your results in CSV format by clicking the "Export CSV" button on the detailed results page. This includes all match information, identity scores, coverage, and additional metadata.
Shareable links: Copy the URL from the Results page to share a specific query view. Locked searches can be shared permanently.
Result expiration: Search results expire after 48 hours (2 days) by default. To preserve results for up to 6 months, click the "Lock Search" button on the results page.

Performance tips

Pick the right mode: Exact for high identity matches (fast); Alignment for divergent/noisy sequences. Note that alignment mode is significantly more expensive and jobs will take considerably longer to complete.
Web limits: Keep web queries small (≤ 10 sequences). Use API/CLI for larger jobs.
Timeouts: Split large inputs, or run via CLI on a workstation/cluster.

Examples

See the dedicated Examples page, or run a pre‑filled search:

🧬 2 short sequences — Two short DNA sequences for quick testing with reference sequences. Try it
💊 5 AMR genes — Five important antimicrobial resistance genes: NDM-1, VIM-2, OXA-48, MCR-1, and TET(M). Search across metagenomes, microbial, and fungal databases. Try it
🦠 SARS-CoV-2 Critical genome elements — A compact FASTA from Wuhan-Hu-1 reference containing spike RBD and S1/S2 furin-cleavage window. Search across all databases. Try it
⭐ 3 famous proteins — Human Insulin, Hemoglobin, and Green Fluorescent Protein (GFP). Search across protein databases. Try it

Using the API

Production endpoint: https://metagraph.ethz.ch:8081/search

Minimal Working Example

This example has been tested and verified to work!

Step 1: Create a Search

curl -X POST "https://metagraph.ethz.ch:8081/search" \
  -H "Content-Type: application/json" \
  -d '{
    "queries": [
      {
        "db": "refseq85_coord",
        "q": "ATGCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG"
      }
    ]
  }'

Note: Use just the sequence (no FASTA header)

Step 2: Check Status

curl "https://metagraph.ethz.ch:8081/search/{search_id}/status"

Keep checking until status is "done"

Step 3: Get Results

curl "https://metagraph.ethz.ch:8081/search/{search_id}/results"

Installing the command line interface

Conda

Install the latest release on Linux or Mac OS X with Anaconda:

conda install -c bioconda -c conda-forge metagraph

Docker

If docker is available on the system, immediately get started with

docker pull ghcr.io/ratschlab/metagraph:master
docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
    metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa

and replace ${HOME} with a directory on the host system to map it under /mnt in the container.

More installation options

By default, it executes the binary compiled for the DNA alphabet {A,C,G,T}. To run the binary compiled for the DNA5 or Protein alphabet, just replace metagraph with metagraph_DNA5 or metagraph_Protein, respectively, e.g.:

docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
    metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa

For more complex workflows, consider running docker in the interactive mode:

$ docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master

root@5c42291cc9cf:/# ls /mnt/
root@5c42291cc9cf:/# metagraph --version

Install From Sources

To compile from source (e.g., for builds with custom alphabet or other configurations), see documentation online.

Using the Command line interface

Typical workflow

Build de Bruijn graph from Fasta files, FastQ files, or KMC k-mer counters:
./metagraph build
Annotate graph using the column compressed annotation:
./metagraph annotate
Transform the built annotation to a different annotation scheme:
./metagraph transform_anno
Query annotated graph:
./metagraph query

Example

DATA="../tests/data/transcripts_1000.fa"

./metagraph build -k 12 -o transcripts_1000 $DATA

./metagraph annotate -i transcripts_1000.dbg --anno-filename -o transcripts_1000 $DATA

./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA

./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000.dbg

More command examples

Print usage

./metagraph

Build graph

Simple build

./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 \
                        -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt

Build with disk swap (use to limit the RAM usage)

./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 --disk-swap <GRAPH_DIR> \
                        -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt

Build from k-mers filtered with KMC

K=20
./KMC/kmc -ci5 -t4 -k$K -m5 -fm <FILE>.fasta.gz <FILE>.cutoff_5 ./KMC
./metagraph build -v -p 4 -k $K --mem-cap-gb 10 -o graph <FILE>.cutoff_5.kmc_pre

Annotate graph

./metagraph annotate -v --anno-type row --fasta-anno \
                           -i primates.dbg \
                           -o primates \
                           ~/fasta_zurich/refs_chimpanzee_primates.fa

Convert annotation to Multi-BRWT

1. Cluster columns

./metagraph transform_anno -v --linkage --greedy \
                           -o linkage.txt \
                           --subsample R \
                           -p NCORES \
                           primates.column.annodbg

2. Construct Multi-BRWT

./metagraph transform_anno -v -p NCORES --anno-type brwt \
                           --linkage-file linkage.txt \
                           -o primates \
                           --parallel-nodes V \
                           -p NCORES \
                           primates.column.annodbg

Query graph

./metagraph query -v -i <GRAPH_DIR>/graph.dbg \
                        -a <GRAPH_DIR>/annotation.column.annodbg \
                        --min-kmers-fraction-label 0.8 --labels-delimiter ", " \
                        query_seq.fa

Align to graph

./metagraph align -v -i <GRAPH_DIR>/graph.dbg query_seq.fa

Assemble sequences

./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                        -o assembled.fa \
                        --unitigs

Assemble differential sequences

./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                        --unitigs \
                        -a <GRAPH_DIR>/annotation.column.annodbg \
                        --diff-assembly-rules diff_assembly_rules.json \
                        -o diff_assembled.fa

Get stats

Stats for graph

./metagraph stats graph.dbg

Stats for annotation

./metagraph stats -a annotation.column.annodbg

Stats for both

./metagraph stats -a annotation.column.annodbg graph.dbg

Troubleshooting

No matches

Switch to Alignment and/or increase sensitivity by decreasing the discovery threshold.
Confirm alphabet: nucleotide vs amino acid; use UniParc for proteins.
Try a broader database set.
Raw‑read indexes use moderate cleaning (MetaGraph and Logan use different strategies to reduce remove infrequent k-mers). Assembled/reference/protein indexes are indexed losslessly with coordinates.

Very large inputs / timeouts

Split into smaller batches or run via API or command line locally.
Prefer exact matching for quick results; re‑run top candidates in alignment mode.

Rate limits / queued jobs

Use the API for medium-sized jobs and local command line version for large jobs.

FAQs

How fresh is the data?

See the Databases page for the live snapshot date and coverage.

Which databases are available?

INSDC subsets (Microbe, Fungi, Plants, Metazoa incl. Human/Mouse), SRA‑MetaGut, RefSeq, UHGG, Tara Oceans, UniParc. See Databases for the active set.

What are the limits?

Web UI supports up to 10 sequences per query (maximum length 50k). Larger runs via API and local command line interfaces.

How reproducible are results?

Indexes are versioned; some include counts or coordinates. Record index name/version and the detailed search parameters.

Are there known limitations?

Raw‑read indexes use moderate cleaning (MetaGraph and Logan use different strategies to reduce remove infrequent k-mers). Assembled/reference/protein indexes are indexed losslessly with coordinates.

Glossary

k‑mer: A substring of length k used for indexing/search.
de Bruijn graph (DBG): Nodes are k‑mers; edges connect overlapping k‑mers.
Exact match: Match query k‑mers exactly against the graph (fast).
Alignment (sequence‑to‑graph): Seed–chain–extend alignment to recover closest paths.
Identity: Fraction of matching positions between query and match.
Coverage: Fraction of query covered by matches/alignments.
AMR: Antimicrobial resistance; e.g., CARD genes across metagenomes.
UHGG: Unified Human Gastrointestinal Genome collection (human gut microbiome resource).
MetaSUB: Metagenomics & Metadesign of Subways & Urban Biomes - global urban microbiome project.
Tara Oceans: Global ocean microbiome survey expedition studying marine plankton ecosystems.
RefSeq: NCBI Reference Sequence database - comprehensive, non-redundant collection of sequences.
Logan contigs: Contigs generated by the Logan project that require a k-mer to appear at least 2 times.
Accession: Unique identifier for a biological sequence in public databases (e.g., NCBI, ENA).
gnomAD: Genome Aggregation Database - large-scale human genetic variation resource.
MetaGraph contigs: Assembled contigs generated in the MetaGraph paper for indexing life's biological sequences.
INSDC: International Nucleotide Sequence Database Collaboration - umbrella for NCBI (GenBank & SRA), EMBL‑EBI (ENA), and Japan's DDBJ (including DRA).
SRA/ENA/DRA: Sequence Read Archive (NCBI), European Nucleotide Archive (EMBL-EBI), and DDBJ Read Archive (Japan) - the three INSDC partners' raw sequencing data repositories.