M
MetaGraph

©2019-2025 BMI LAB | ETH ZURICH | PRIVACY | IMPRINT

    • Help
    • Quick start
    • How‑to guides
    • Interpreting results
    • Export & share
    • Performance tips
    • Query examples
    • Using the API
    • Installing the Command Line Interface
    • Using the Command Line Interface
    • Troubleshooting
    • FAQs
    • Contact & support
    • Glossary

    Help

    Search DNA, RNA, and protein sequences across public archives using annotated de Bruijn graphs with exact matching or sensitive alignment.

    Contact & support

    Report issues or request features:

    • Email: metagraph@inf.ethz.ch
    • GitHub: ratschlab/metagraph
    • Feedback form: open the form
    Search now
    Quick start
    1. Paste sequence — Paste FASTA or upload a .fasta/.fa file on the Search page. The UI validates FASTA and shows size/sequence count.
    2. Pick databases — Choose from INSDC subsets (Microbe, Fungi, Plants, Metazoa incl. Human/Mouse), reference/assembled sets (RefSeq, UHGG, Tara Oceans), or proteins (UniParc).
    3. Run search — Start with Exact match; switch to Alignment for lower‑identity or noisy sequences.

    Web limits: up to 10 sequences per web query. For larger batches, use the web Application Programming Interface (API) or Command Line Interface (CLI).

    Search nowDocsExamples
    How‑to guides
    DNA/RNA/protein, filters, and database choices

    DNA / RNA / protein searches

    • DNA/RNA: Use nucleotide FASTA. Start with Exact; if identity is low or variants are expected, toggle Alignment.
    • Proteins: Use the UniParc index for amino‑acid sequences.

    Filters & parameters

    • Mode: Exact (fast) vs Alignment (sensitive).
    • Top labels: Limit the number of accessions returned.
    • Discovery threshold: The minimum fraction of k-mers that must match. Lower thresholds (lenient) return more hits; higher thresholds (strict) return fewer, higher-confidence hits. Adjust this parameter to balance sensitivity and specificity.

    Choosing databases

    See the Databases page for live coverage and whether an index includes counts or coordinates.

    View databases
    Interpreting results
    Understanding your MetaGraph search results

    Results Overview

    MetaGraph returns hits organized by database and accession. Similar to BLAST, results are ranked by relevance, but instead of E-values, MetaGraph uses discovery threshold and k-mer matching to identify significant matches.

    Key Result Metrics

    • Identity — Fraction of query bases identical in the matched path. In alignment mode, this shows the percentage of exact matches between your query and the target sequence. Higher identity (closer to 100%) indicates a closer match.
    • Coverage — Fraction of the query covered by matches/alignments. This tells you how much of your input sequence was found in the database. For RNA-seq data with counts, coverage approximates expression profiles across samples.
    • Discovery Threshold — The minimum fraction of k-mers that must match. Lower thresholds (lenient) return more hits; higher thresholds (strict) return fewer, higher-confidence hits. Adjust this parameter to balance sensitivity and specificity.
    • Matched Accessions / Labels — Each hit corresponds to a specific sample, genome, or protein record in the database. Click on accession numbers to view the original sequence record (when available).

    Additional Information (Database-Dependent)

    • Counts — Available for some databases. Shows per-k-mer/sample abundances, useful for quantifying expression levels in RNA-seq or metagenomic data. Higher counts suggest higher abundance in the source sample.
    • Coordinates — Available for reference genome databases (e.g., RefSeq). Shows genomic positions where your query matches, enabling precise localization on assembled chromosomes or contigs.
    • Taxonomy — Provided when databases are organized by taxonomic classification. Helps identify which organisms or taxonomic groups contain your query sequence. Use the AI Organism Identification feature for automated species inference.

    Result Table Navigation

    • Sort columns — Click column headers to sort by identity, coverage, or other metrics.
    • Filter results — Use the search box to filter by accession name or taxonomic group.
    • Pagination — Navigate through pages if many hits are found. Adjust rows per page (10, 25, 50, 100) for convenience.
    • Export — Download results as CSV or JSON for further analysis.

    No Results Found?

    If your search returns no hits, consider:

    • Lowering the discovery threshold — Try 0.5 or lower for more lenient matching.
    • Selecting different databases — Your sequence might be present in databases you haven't searched.
    • Checking sequence type — Ensure you've selected "Nucleotide" or "Amino acid" correctly.
    • Verifying FASTA format — Make sure your input is properly formatted (see Quick start section).
    Exporting & sharing
    • Download CSV: Export your results in CSV format by clicking the "Export CSV" button on the detailed results page. This includes all match information, identity scores, coverage, and additional metadata.
    • Shareable links: Copy the URL from the Results page to share a specific query view. Locked searches can be shared permanently.
    • Result expiration: Search results expire after 48 hours (2 days) by default. To preserve results for up to 6 months, click the "Lock Search" button on the results page.
    Performance tips
    • Pick the right mode: Exact for high identity matches (fast); Alignment for divergent/noisy sequences. Note that alignment mode is significantly more expensive and jobs will take considerably longer to complete.
    • Web limits: Keep web queries small (≤ 10 sequences). Use API/CLI for larger jobs.
    • Timeouts: Split large inputs, or run via CLI on a workstation/cluster.
    Examples

    See the dedicated Examples page, or run a pre‑filled search:

    • 🧬 2 short sequences — Two short DNA sequences for quick testing with reference sequences. Try it
    • 💊 5 AMR genes — Five important antimicrobial resistance genes: NDM-1, VIM-2, OXA-48, MCR-1, and TET(M). Search across metagenomes, microbial, and fungal databases. Try it
    • 🦠 SARS-CoV-2 Critical genome elements — A compact FASTA from Wuhan-Hu-1 reference containing spike RBD and S1/S2 furin-cleavage window. Search across all databases. Try it
    • ⭐ 3 famous proteins — Human Insulin, Hemoglobin, and Green Fluorescent Protein (GFP). Search across protein databases. Try it
    Using the API
    Production endpoint: https://metagraph.ethz.ch:8081/search

    Minimal Working Example

    This example has been tested and verified to work!

    Step 1: Create a Search

    curl -X POST "https://metagraph.ethz.ch:8081/search" \
      -H "Content-Type: application/json" \
      -d '{
        "queries": [
          {
            "db": "refseq85_coord",
            "q": "ATGCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG"
          }
        ]
      }'

    Note: Use just the sequence (no FASTA header)

    Step 2: Check Status

    curl "https://metagraph.ethz.ch:8081/search/{search_id}/status"

    Keep checking until status is "done"

    Step 3: Get Results

    curl "https://metagraph.ethz.ch:8081/search/{search_id}/results"
    Installing the command line interface

    Conda

    Install the latest release on Linux or Mac OS X with Anaconda:

    conda install -c bioconda -c conda-forge metagraph

    Docker

    If docker is available on the system, immediately get started with

    docker pull ghcr.io/ratschlab/metagraph:master
    docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
        metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa

    and replace ${HOME} with a directory on the host system to map it under /mnt in the container.

    More installation options

    By default, it executes the binary compiled for the DNA alphabet {A,C,G,T}. To run the binary compiled for the DNA5 or Protein alphabet, just replace metagraph with metagraph_DNA5 or metagraph_Protein, respectively, e.g.:

    docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
        metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa

    For more complex workflows, consider running docker in the interactive mode:

    $ docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master
    
    root@5c42291cc9cf:/# ls /mnt/
    root@5c42291cc9cf:/# metagraph --version

    Install From Sources

    To compile from source (e.g., for builds with custom alphabet or other configurations), see documentation online.

    Using the Command line interface

    Typical workflow

    1. Build de Bruijn graph from Fasta files, FastQ files, or KMC k-mer counters:
      ./metagraph build
    2. Annotate graph using the column compressed annotation:
      ./metagraph annotate
    3. Transform the built annotation to a different annotation scheme:
      ./metagraph transform_anno
    4. Query annotated graph:
      ./metagraph query

    Example

    DATA="../tests/data/transcripts_1000.fa"
    
    ./metagraph build -k 12 -o transcripts_1000 $DATA
    
    ./metagraph annotate -i transcripts_1000.dbg --anno-filename -o transcripts_1000 $DATA
    
    ./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA
    
    ./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000.dbg
    More command examples

    Print usage

    ./metagraph

    Build graph

    Simple build

    ./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 \
                            -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
    2>&1 | tee <LOG_DIR>/log.txt

    Build with disk swap (use to limit the RAM usage)

    ./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 --disk-swap <GRAPH_DIR> \
                            -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
    2>&1 | tee <LOG_DIR>/log.txt

    Build from k-mers filtered with KMC

    K=20
    ./KMC/kmc -ci5 -t4 -k$K -m5 -fm <FILE>.fasta.gz <FILE>.cutoff_5 ./KMC
    ./metagraph build -v -p 4 -k $K --mem-cap-gb 10 -o graph <FILE>.cutoff_5.kmc_pre

    Annotate graph

    ./metagraph annotate -v --anno-type row --fasta-anno \
                               -i primates.dbg \
                               -o primates \
                               ~/fasta_zurich/refs_chimpanzee_primates.fa

    Convert annotation to Multi-BRWT

    1. Cluster columns

    ./metagraph transform_anno -v --linkage --greedy \
                               -o linkage.txt \
                               --subsample R \
                               -p NCORES \
                               primates.column.annodbg

    2. Construct Multi-BRWT

    ./metagraph transform_anno -v -p NCORES --anno-type brwt \
                               --linkage-file linkage.txt \
                               -o primates \
                               --parallel-nodes V \
                               -p NCORES \
                               primates.column.annodbg

    Query graph

    ./metagraph query -v -i <GRAPH_DIR>/graph.dbg \
                            -a <GRAPH_DIR>/annotation.column.annodbg \
                            --min-kmers-fraction-label 0.8 --labels-delimiter ", " \
                            query_seq.fa

    Align to graph

    ./metagraph align -v -i <GRAPH_DIR>/graph.dbg query_seq.fa

    Assemble sequences

    ./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                            -o assembled.fa \
                            --unitigs

    Assemble differential sequences

    ./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                            --unitigs \
                            -a <GRAPH_DIR>/annotation.column.annodbg \
                            --diff-assembly-rules diff_assembly_rules.json \
                            -o diff_assembled.fa

    Get stats

    Stats for graph

    ./metagraph stats graph.dbg

    Stats for annotation

    ./metagraph stats -a annotation.column.annodbg

    Stats for both

    ./metagraph stats -a annotation.column.annodbg graph.dbg
    Troubleshooting
    No matches
    • Switch to Alignment and/or increase sensitivity by decreasing the discovery threshold.
    • Confirm alphabet: nucleotide vs amino acid; use UniParc for proteins.
    • Try a broader database set.
    • Raw‑read indexes use moderate cleaning (MetaGraph and Logan use different strategies to reduce remove infrequent k-mers). Assembled/reference/protein indexes are indexed losslessly with coordinates.
    Very large inputs / timeouts
    • Split into smaller batches or run via API or command line locally.
    • Prefer exact matching for quick results; re‑run top candidates in alignment mode.
    Rate limits / queued jobs
    • Use the API for medium-sized jobs and local command line version for large jobs.
    FAQs
    How fresh is the data?

    See the Databases page for the live snapshot date and coverage.

    Which databases are available?

    INSDC subsets (Microbe, Fungi, Plants, Metazoa incl. Human/Mouse), SRA‑MetaGut, RefSeq, UHGG, Tara Oceans, UniParc. See Databases for the active set.

    What are the limits?

    Web UI supports up to 10 sequences per query (maximum length 50k). Larger runs via API and local command line interfaces.

    How reproducible are results?

    Indexes are versioned; some include counts or coordinates. Record index name/version and the detailed search parameters.

    Are there known limitations?

    Raw‑read indexes use moderate cleaning (MetaGraph and Logan use different strategies to reduce remove infrequent k-mers). Assembled/reference/protein indexes are indexed losslessly with coordinates.

    Glossary
    k‑mer
    A substring of length k used for indexing/search.
    de Bruijn graph (DBG)
    Nodes are k‑mers; edges connect overlapping k‑mers.
    Exact match
    Match query k‑mers exactly against the graph (fast).
    Alignment (sequence‑to‑graph)
    Seed–chain–extend alignment to recover closest paths.
    Identity
    Fraction of matching positions between query and match.
    Coverage
    Fraction of query covered by matches/alignments.
    AMR
    Antimicrobial resistance; e.g., CARD genes across metagenomes.
    UHGG
    Unified Human Gastrointestinal Genome collection (human gut microbiome resource).
    MetaSUB
    Metagenomics & Metadesign of Subways & Urban Biomes - global urban microbiome project.
    Tara Oceans
    Global ocean microbiome survey expedition studying marine plankton ecosystems.
    RefSeq
    NCBI Reference Sequence database - comprehensive, non-redundant collection of sequences.
    Logan contigs
    Contigs generated by the Logan project that require a k-mer to appear at least 2 times.
    Accession
    Unique identifier for a biological sequence in public databases (e.g., NCBI, ENA).
    gnomAD
    Genome Aggregation Database - large-scale human genetic variation resource.
    MetaGraph contigs
    Assembled contigs generated in the MetaGraph paper for indexing life's biological sequences.
    INSDC
    International Nucleotide Sequence Database Collaboration - umbrella for NCBI (GenBank & SRA), EMBL‑EBI (ENA), and Japan's DDBJ (including DRA).
    SRA/ENA/DRA
    Sequence Read Archive (NCBI), European Nucleotide Archive (EMBL-EBI), and DDBJ Read Archive (Japan) - the three INSDC partners' raw sequencing data repositories.