Search DNA, RNA, and protein sequences across public archives using annotated de Bruijn graphs with exact matching or sensitive alignment.
Web limits: up to 10 sequences per web query. For larger batches, use the web Application Programming Interface (API) or Command Line Interface (CLI).
See the Databases page for live coverage and whether an index includes counts or coordinates.
View databasesMetaGraph returns hits organized by database and accession. Similar to BLAST, results are ranked by relevance, but instead of E-values, MetaGraph uses discovery threshold and k-mer matching to identify significant matches.
If your search returns no hits, consider:
See the dedicated Examples page, or run a pre‑filled search:
This example has been tested and verified to work!
curl -X POST "https://metagraph.ethz.ch:8081/search" \
-H "Content-Type: application/json" \
-d '{
"queries": [
{
"db": "refseq85_coord",
"q": "ATGCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG"
}
]
}'
Note: Use just the sequence (no FASTA header)
curl "https://metagraph.ethz.ch:8081/search/{search_id}/status"
Keep checking until status is "done"
curl "https://metagraph.ethz.ch:8081/search/{search_id}/results"
Install the latest release on Linux or Mac OS X with Anaconda:
conda install -c bioconda -c conda-forge metagraph
If docker is available on the system, immediately get started with
docker pull ghcr.io/ratschlab/metagraph:master
docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa
and replace ${HOME}
with a directory on the host system to map it under /mnt
in the container.
By default, it executes the binary compiled for the DNA
alphabet {A,C,G,T}. To run the binary compiled for the DNA5
or Protein
alphabet, just replace metagraph
with metagraph_DNA5
or metagraph_Protein
, respectively, e.g.:
docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa
For more complex workflows, consider running docker in the interactive mode:
$ docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master
root@5c42291cc9cf:/# ls /mnt/
root@5c42291cc9cf:/# metagraph --version
To compile from source (e.g., for builds with custom alphabet or other configurations), see documentation online.
./metagraph build
./metagraph annotate
./metagraph transform_anno
./metagraph query
DATA="../tests/data/transcripts_1000.fa"
./metagraph build -k 12 -o transcripts_1000 $DATA
./metagraph annotate -i transcripts_1000.dbg --anno-filename -o transcripts_1000 $DATA
./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA
./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000.dbg
./metagraph
Simple build
./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 \
-o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt
Build with disk swap (use to limit the RAM usage)
./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 --disk-swap <GRAPH_DIR> \
-o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt
Build from k-mers filtered with KMC
K=20
./KMC/kmc -ci5 -t4 -k$K -m5 -fm <FILE>.fasta.gz <FILE>.cutoff_5 ./KMC
./metagraph build -v -p 4 -k $K --mem-cap-gb 10 -o graph <FILE>.cutoff_5.kmc_pre
./metagraph annotate -v --anno-type row --fasta-anno \
-i primates.dbg \
-o primates \
~/fasta_zurich/refs_chimpanzee_primates.fa
1. Cluster columns
./metagraph transform_anno -v --linkage --greedy \
-o linkage.txt \
--subsample R \
-p NCORES \
primates.column.annodbg
2. Construct Multi-BRWT
./metagraph transform_anno -v -p NCORES --anno-type brwt \
--linkage-file linkage.txt \
-o primates \
--parallel-nodes V \
-p NCORES \
primates.column.annodbg
./metagraph query -v -i <GRAPH_DIR>/graph.dbg \
-a <GRAPH_DIR>/annotation.column.annodbg \
--min-kmers-fraction-label 0.8 --labels-delimiter ", " \
query_seq.fa
./metagraph align -v -i <GRAPH_DIR>/graph.dbg query_seq.fa
./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
-o assembled.fa \
--unitigs
./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
--unitigs \
-a <GRAPH_DIR>/annotation.column.annodbg \
--diff-assembly-rules diff_assembly_rules.json \
-o diff_assembled.fa
Stats for graph
./metagraph stats graph.dbg
Stats for annotation
./metagraph stats -a annotation.column.annodbg
Stats for both
./metagraph stats -a annotation.column.annodbg graph.dbg
See the Databases page for the live snapshot date and coverage.
INSDC subsets (Microbe, Fungi, Plants, Metazoa incl. Human/Mouse), SRA‑MetaGut, RefSeq, UHGG, Tara Oceans, UniParc. See Databases for the active set.
Web UI supports up to 10 sequences per query (maximum length 50k). Larger runs via API and local command line interfaces.
Indexes are versioned; some include counts or coordinates. Record index name/version and the detailed search parameters.
Raw‑read indexes use moderate cleaning (MetaGraph and Logan use different strategies to reduce remove infrequent k-mers). Assembled/reference/protein indexes are indexed losslessly with coordinates.
Report issues or request features: