Sequence search¶
Attention
This page is in development
MetaGraph allows for query sequences to be searched against the graph alone, returning the closest path in the graph, or against the annotated graph, returning a set of associated labels. These are referred to as the align and query regimes, respectively.
Align sequences to the graph¶
Sequence alignment features can be accessed via metagraph align
.
By default, a sequence of graph paths is output which form a disjoint cover of the
query sequence. Depending on the desired level of sensitivity, alignment options range
from simply finding exact k-mer matches to performing an alignment to finding a
best-scoring path in the graph.
Exact k-mer matching¶
Also referred to as pseudo-alignment, this feature is accessed via the additional --map
flag.
This mode extracts the sequence of k-mers from a query sequence and reports the indices
of the corresponding nodes in the graph. An example command may be:
metagraph align --map -i MYGRAPH.dbg MYREADS.fa
Input sequences may be in FASTA or FASTQ format (uncompressed or gzipped).
The output is in TSV format with the first column being an input k-mer and the second
column being the corresponding node in the graph (or 0
if not present).
For less verbose output, the additional --query-presence
and --count-kmers
flags are available.
--query-presence
outputs one line per sequence indicating whether the sequence is present (1
) or absent (0
). A sequence is considered to be present if its fraction of present k-mers is at leastd
, as set by the--min-kmers-fraction-label
flag.--count-kmers
outputs one line per sequence in TSV format. The first column is the sequence header, while the second column is of the forma/b/c
, wherea
is the number of matching k-mers,b
is the total number of k-mers, andc
is the total number of unique matching k-mers (where reverse complements are considered to be matching).
Sequence-to-graph alignment¶
If the --map
flag is not used, this enables sequence-to-graph alignment, approximately finding the best-matching path in the graph to the query sequence. Alongside this path, this mode also returns an alignment score and a CIGAR string describing the edits needed to transform the spelling of the graph path to the query sequence. An example command may be:
metagraph align -i MYGRAPH.dbg MYREADS.fa
The output of the query is in TSV format, with one line per query sequence, where the columns are as follows:
Query name
Query sequence
Strand
Reference sequence (the spelling of the matched path
Alignment score
Number of exact matches
CIGAR string-like alignment summary
Number of nucleotides trimmed from the prefix of the reference sequence
Reference label matches (
;
separated, present if the-a
flag is passed)
An important parameter is the seed length, which can be set with --align-min-seed-length
and can be shorter than the value of k used to construct the graph.
If an annotator is provided with the -a
flag, the returned alignments will be label-consistent, meaning that there is at least one label that is shared by all nodes on the path.