MetaGraph Docs
  • Docs
      • Installation
      • Quick start
      • Snakemake workflows
      • Python API
      • Sequence search
      • Sequence assembly
      • Resources
  • Page
      • Sequence search
        • Align sequences to the graph
          • Exact k-mer matching
          • Sequence-to-graph alignment
  • « Python API
  • Sequence assembly »
  • Home

Sequence search¶

Attention

This page is in development

MetaGraph allows for query sequences to be searched against the graph alone, returning the closest path in the graph, or against the annotated graph, returning a set of associated labels. These are referred to as the align and query regimes, respectively.

Align sequences to the graph¶

Sequence alignment features can be accessed via metagraph align. By default, a sequence of graph paths is output which form a disjoint cover of the query sequence. Depending on the desired level of sensitivity, alignment options range from simply finding exact k-mer matches to performing an alignment to finding a best-scoring path in the graph.

Exact k-mer matching¶

Also referred to as pseudo-alignment, this feature is accessed via the additional --map flag. This mode extracts the sequence of k-mers from a query sequence and reports the indices of the corresponding nodes in the graph. An example command may be:

metagraph align --map -i MYGRAPH.dbg MYREADS.fa

Input sequences may be in FASTA or FASTQ format (uncompressed or gzipped). The output is in TSV format with the first column being an input k-mer and the second column being the corresponding node in the graph (or 0 if not present).

For less verbose output, the additional --query-presence and --count-kmers flags are available.

  • --query-presence outputs one line per sequence indicating whether the sequence is present (1) or absent (0). A sequence is considered to be present if its fraction of present k-mers is at least d, as set by the --min-kmers-fraction-label flag.

  • --count-kmers outputs one line per sequence in TSV format. The first column is the sequence header, while the second column is of the form a/b/c, where a is the number of matching k-mers, b is the total number of k-mers, and c is the total number of unique matching k-mers (where reverse complements are considered to be matching).

Sequence-to-graph alignment¶

If the --map flag is not used, this enables sequence-to-graph alignment, approximately finding the best-matching path in the graph to the query sequence. Alongside this path, this mode also returns an alignment score and a CIGAR string describing the edits needed to transform the spelling of the graph path to the query sequence. An example command may be:

metagraph align -i MYGRAPH.dbg MYREADS.fa

The output of the query is in TSV format, with one line per query sequence, where the columns are as follows:

  1. Query name

  2. Query sequence

  3. Strand

  4. Reference sequence (the spelling of the matched path

  5. Alignment score

  6. Number of exact matches

  7. CIGAR string-like alignment summary

  8. Number of nucleotides trimmed from the prefix of the reference sequence

  9. Reference label matches (; separated, present if the -a flag is passed)

An important parameter is the seed length, which can be set with --align-min-seed-length and can be shorter than the value of k used to construct the graph.

If an annotator is provided with the -a flag, the returned alignments will be label-consistent, meaning that there is at least one label that is shared by all nodes on the path.

Back to top

Source

© Copyright 2022, Mikhail Karasikov, Harun Mustafa, Andre Kahles, Gunnar Ratsch, Daniel Danciu, Marc Zimmermann, Chris Barber.
Created using Sphinx 4.4.0.