Quick start

MetaGraph constructs indexes composed of two main elements: a k-mer index and an annotation matrix.

The k-mer index stores all k-mers from the input sequences and represents a de Bruijn graph. This index serves as a dictionary mapping k-mers to their unique positive identifiers.

The second element is a matrix encoding the relation between the k-mers and their attributes. These relations may represent, for instance:

  • k-mer i is present in sequence/genome j

  • k-mer i is present in SRA sample j

  • k-mer i is present in file j

  • k-mer i is present in a collection of sequences/files marked by j

  • k-mer i is highly expressed in sample j

The annotation matrix can also be supplemented with additional attributes to represent such quantities as:

  • k-mer i occurs \(c_i\) times in sample j (k-mer abundance)

  • k-mer i occurs at positions \(p_1,\dots,p_{c_i}\) in genome j (k-mer coordinates)

In the following, you will find simplified instructions and examples for constructing a MetaGraph index and querying it.

The indexing workflow in MetaGraph consists of two major steps: graph construction and annotation construction.

Construct graph

Basics

De Bruijn graphs are constructed with metagraph build:

metagraph build --verbose --parallel 4 -k 31 --outfile-base graph transcripts_1000.fa

where transcripts_1000.fa is a fasta/fastq file (may be gzipped) with input sequences. The same command can be shortened to:

metagraph build -v -p 4 -k 31 -o graph transcripts_1000.fa

It is possible to pass multiple input files:

metagraph build -v -p 4 -k 31 -o graph file_1.fa file_2.fa ...

or pipe a list of files:

find . -name "*.fa" | metagraph build -v -p 4 -k 31 -o graph

These commands construct a single de Bruijn graph (a k-mer index) from all k-mers extracted from the input sequences.

To see the list of all available flags, type metagraph build.

Note

There are various graph representations available in MetaGraph, which can be chosen via the flag --graph. However, the default succinct representation is usually the best choice because of its great scalability (requires only 2-4 bits per k-mer) and the ability to search sub-k-mers/k-mer ranges, which is important for sequence alignment.

To check the statistics for a constructed graph, type:

metagraph stats graph.dbg

Construct with disk swap

For very large inputs, graphs can be constructed with disk swap to limit the RAM usage (currently, available only for succinct graph representations). For example, the size of the buffer kept in memory can be set to 4 GiB as follows:

metagraph build -v -k 31 -o graph -p 36 --disk-swap /tmp --mem-cap-gb 4 file.fa ...

Using larger buffers usually speeds up the construction. However, using a 50-80 GiB buffer is always sufficient, even when constructing graphs containing trillions of k-mers.

Construct from KMC counters

Apart from the standard FASTA/FASTQ/VCF input formats (uncompressed or gzipped), a graph can be constructed from k-mer counters generated by KMC.

KMC is an extremely efficient tool for counting k-mers, which can be used to quickly pre-process the input and deduplicate/count/filter the input k-mers. For example, the following commands can be used to construct a graph only from k-mers occurring at least 5 times in the input:

K=31
# count k-mers with KMC
./KMC/kmc -ci5 -t4 -k$K -m5 -fm SRR403017.fasta.gz SRR403017.cutoff_5 ./KMC
# build graph with MetaGraph
metagraph build -v -p 4 -k $K -o graph SRR403017.cutoff_5.kmc_pre

Note

In the above example, we use ./KMC as the directory where KMC will store its intermediate caches. Depending on your input data, this directory should be at a location with a sufficient amount of free intermediate storage space.

Construct weighted graph

Since KMC does compute the exact counts of all k-mers from the input, this information can be taken into account when transforming into a de Bruijn graph. Specifically, a node weight (k-mer count) can be associated with every k-mer from the graph. All node weights supplement the graph in the form of an integer vector graph.dbg.weights, which can be constructed by simply adding the flag --count-kmers to the graph construction command:

metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.cutoff_5.kmc_pre

which will construct a de Bruijn graph graph.dbg and an array of counts counts.dbg.weights associated with its k-mers. This supplementing vector of counts can then be used for graph cleaning (see Graph cleaning) or indexing k-mer counts with pre-counting (see Annotate with pre-counting).

Note

A weighted graph can also be constructed directly from raw input sequences, without pre-counting with KMC, e.g.,:

metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.fasta.gz

This should be used when pre-processing with KMC is complicated or impossible, e.g., when indexing protein sequences.

Transform to other representations

To transform a succinct graph into a more compressed and smaller representation, run:

metagraph transform -v --state small -p 4 -o graph_small graph.dbg

Transform to sequences

To transform a graph back to sequences, it can be traversed to extract all its contigs:

metagraph transform -v --to-fasta -o contigs -p 4 graph.dbg

These sequences contain all k-mers indexed in the graph exactly once and can be used as their non-redundant (deduplicated) representation.

The assembled contigs are written to a compressed FASTA file, which can be inspected with:

zless contigs.fasta.gz

To extract all unitigs (linear paths in the graph) instead of contigs, add flag --unitigs to the same metagraph transform command.

Note

If the source de Bruijn graph is weighted (see section Construct weighted graph), the contigs (written to file *.fasta.gz) will be extracted along with the counts of their constituting k-mers, written to file *.kmer_counts.gz.

Construct canonical graph

When the input sequences are raw reads of unknown directionality (strandedness), it is natural to index along with each sequence its reverse complement.

MetaGraph has a special graph mode where each k-mer indexed in the graph automatically adds its reverse complement k-mer to the index. To build a canonical graph from a set of reads/sequences, add --mode canonical to the build command:

find . -name "*.fa" | metagraph build -v -p 4 -k 31 -o graph --mode canonical

Important

Canonical graphs should not be used in combination with delta-coded annotations of type RowDiff<*>. For canonical graphs, only half of the k-mers are annotated, which creates a huge number of “gaps” in the annotation, diminishing the effectiveness of the coding. Instead, canonical graphs should always be transformed to primary (see section Construct primary graph below) before annotating them.

Construct primary graph

Canonical graphs contain each k-mer in both of its forms (forward and reverse complement), but the same data structure can be modeled by storing only one of them, implicitly modeling the other. Often, different tools achieve this by only storing the lexicographically smallest of the two k-mers. However, it is not possible to efficiently implement this with the succinct graph representation. Hence, we relax this constraint and pick any of the two forms of each k-mer. In a nutshell, this representation is constructed by fully traversing the canonical graph and marking a k-mer as primary if it was reached before its reverse complement in the traversal. The graph containing only primary k-mers is called a primary graph.

The algorithm for primarization of a canonical graph is as follows:

  1. First, extract a set of primary contigs (stretches of primary k-mers) from the canonical graph:

    metagraph transform -v --to-fasta --primary-kmers -o primary_contigs -p 4 graph.dbg
    
  2. Then, construct a new graph from the primary contigs and mark this graph as primary by adding --mode primary to the build command:

    metagraph build -v -p 4 -k 31 -o graph_primary --mode primary primary_contigs.fasta.gz
    

Now, this new graph graph_primary.dbg emulates the original canonical graph (e.g., when querying or annotating). It represents the same information as the original canonical graph, while taking only half of the space.

Graph cleaning

For removing sequencing errors, MetaGraph provides routines for graph cleaning and k-mer filtering. These are based on the assumption that k-mers with relatively low abundance (low k-mer counts) in the input data were likely generated due to sequencing errors, and hence should be dropped. Moreover, to make the cleaning procedure more robust, the decision about filtering out a k-mer can be based on the median abundance of the unitig to which this k-mer belongs. That is, k-mers with low abundance are preserved if they are situated in a unitig with sufficiently many highly abundant k-mers.

K=31
metagraph build -v -p 4 -k $K --count-kmers -o graph SRR403017.fasta.gz

metagraph clean -v -p 4 --to-fasta --prune-tips $((2*$K)) --prune-unitigs 0 --fallback 2 \
                -o SRR403017_clean_contigs graph.dbg

zless SRR403017_clean_contigs.fasta.gz

Note

The default parameters in metagraph clean correspond to no cleaning. That is, an equivalent of metagraph transform --to-fasta, which extracts from the input de Bruijn graph all contigs, without removing any k-mers.

For cleaning graphs constructed from high-throughput Illumina reads, the recommended parameters are --prune-tips <2k> --prune-unitigs 0 --fallback 2, which implements the cleaning procedure proposed in McCortex (Turner et al., 2018) and includes the following steps:

  1. Prune all tips shorter than 2k, where k is the k-mer length.

  2. Compute a threshold for the minimum k-mer abundance as follows. Assume the number of k-mers with sequencing errors (erroneous k-mers) follows a Poisson distribution with a Gamma distributed mean. Also, assume that all k-mers with an abundance of 3 or less are generated due to sequencing errors. Based on these numbers, fit a Poisson distribution and pick a threshold such that k-mers predicted to be erroneous make up at most 0.1% of the total k-mer coverage at that abundance level. If the chosen threshold keeps less than 20% of the total coverage, deem the automatic estimation procedure unsuccessful and use the fallback value of 2 instead (set by flag --fallback).

  3. Traverse the graph (where all short tips have already been removed in step 1) and fetch all unitigs with a median k-mer abundance greater or equal to the threshold defined in step 2.

Once all clean contigs (or unitigs) are extracted from a de Bruijn graph, construct a clean de Bruijn graph from them.

Tip

When indexing multiple read sets, the recommended workflow is to build a sample de Bruijn graph from each read set separately and clean these sample graphs independently (that is, extract clean contigs from each of them). Next, build a joint de Bruijn graph from all these clean contigs and finally annotate it using the generated clean contig sets instead of the original raw read sets.

Annotate graph

Once a graph is constructed, there are multiple ways to construct the corresponding annotation to encode its metadata.

Annotate sequence headers

For annotating each sequence with its header in the fasta/fastq file, run

metagraph annotate -v -i graph.dbg --anno-header -o annotation transcripts_1000.fa

This is a common annotation scenario when indexing reference sequences or assembled genomes.

To check the statistics for the constructed annotation, type:

metagraph stats -a annotation.column.annodbg

All annotation labels (column names) for an annotation matrix can be printed with:

metagraph stats --print-col-names -a annotation.column.annodbg

Annotate source filenames

To label all k-mers from the same file with a common label (for instance for the experiment discovery problem), the command is:

metagraph annotate -v -i graph.dbg --anno-filename -o annotation file_1.fa file_2.fa ...

which will annotate k-mers from the first file by label file_1.fa, k-mers from the second file by label file_2.fa, etc.

Annotate with disk swap

When the input files and the output annotation are very large, disk swap space can be used by setting flags --disk-swap and --mem-cap-gb, to limit the size of internal buffers and reduce RAM usage during annotation construction:

metagraph annotate -v -i graph.dbg --anno-filename --disk-swap /tmp --mem-cap-gb 1 \
                      -o annotation file_1.fa file_2.fa ...

Annotate files independently

It is recommended to independently construct a single annotation column per each input file. To do this in parallel and avoid loading the same graph multiple times, run one annotation command with the flag --separately added:

metagraph annotate -v -i graph.dbg --anno-filename --separately -p 4 --threads-each 9 \
                      -o annotation file_1.fa file_2.fa ...

This will create a new directory annotation/ with individual annotation columns:

file_1.fa.column.annodbg    file_2.fa.column.annodbg    ...

Note, in the command above we passed -p 4 --threads-each 9 to annotate 4 files at a time, in parallel, where each uses 9 threads. Thus, this uses 36 threads in total.

Tip

It is recommended to run annotation from a set of long (primary) contigs/unitigs, where all k-mers have already been deduplicated, especially when annotating a (primary) graph in the succinct representation. In contrast, annotating a succinct graph from separate k-mers (especially not deduplicated) will take orders of magnitude longer. The contigs serve as an equivalent non-redundant representation of the k-mer sets and, thus, result in the same graph annotation. Thus, in practice, for large inputs, it is recommended to construct individual (canonical) de Bruijn graphs from all read sets, called sample graphs, and transform them to contigs. These contig sets are then used instead of the original read sets to construct and annotate the joint (primary) graph.

Annotate from KMC counters

This might depend on the particular graph representation used to store the joint graph. However, with the succinct graph representation, it is never efficient to annotate a graph directly from KMC counters because in their format k-mers are not ordered, which leads to many random k-mer lookups in the BOSS table.

There is, however, an extra pre-processing step, which makes this task efficiently solvable. First, one can assemble a graph from the KMC counter (see Construct from KMC counters) and extract contigs from it (see Transform to sequences or Graph cleaning). Next, annotate the graph using these assembled contigs as a normal FASTA file instead of the original KMC counter.

Tip

Together, Construct from KMC counters and Annotate from KMC counters provide an efficient algorithm for constructing an annotated graph from many input files with sequences (e.g., indexing SRA experiments). Namely, this algorithm includes the following steps.

  1. Deduplicate all k-mers in all files with KMC (construct KMC counters, possibly with filtering by abundance)

  2. Construct sample graphs from all KMC counters

  3. Extract contigs from each sample graph (possibly primary contigs, can also be combined with graph cleaning)

  4. Build a large joint graph from all extracted contigs (possibly, in the primary mode)

  5. Annotate this joint graph using the same files with contigs

Annotate using custom labels

To add a custom annotation label for all k-mers from an input file, add --anno-label <LABEL_NAME> when annotating the graph.

Index k-mer counts

MetaGraph supports indexing k-mer counts (k-mer abundances), e.g., to represent gene expression in RNA-seq data.

The counts can supplement graphs in any representation. To construct a MetaGraph index with k-mer counts (Counting de Bruijn graph), construct a de Bruijn graph as usual (see Construct graph) and then add --count-kmers to the annotation command, e.g.:

metagraph annotate -v -i graph.dbg --anno-filename --count-kmers -p 4 \
                      -o annotation transcripts_1000.fa

Along with the normal (binary) graph annotation annotation.column.annodbg, this command will also create an array of corresponding k-mer counts annotation.column.annodbg.counts. These counts represent how many times each k-mer indexed in graph.dbg occurs in the input file transcripts_1000.fa.

Note

By default, each count is stored in an 8-bit integer, and all counts greater than 255 are clipped. This value, however, can be changed with the flag --count-width, to represent counts greater than 255 or, the other way around, clip all large counts when only lowly abundant k-mers are of interest.

All other flags (e.g., --separately and --disk-swap) described above are also supported similarly as for binary annotations.

The histogram for indexed k-mer counts can be viewed with:

metagraph stats -a annotation.column.annodbg --print-counts-hist

It is also possible to compute and print quantiles of all indexed counts. For instance, type the following command to compute the minimum non-zero count, as well as the median and the maximum count:

metagraph stats -a annotation.column.annodbg --count-quantiles '0 0.5 1'

Annotate with pre-counting

Similarly to the case of simple binary annotations considered above, it is recommended to pre-count k-mers for each annotation label (typically, sequencing sample) before annotating it. This technique consists in first constructing a weighted de Bruijn graph (see Construct weighted graph and note that it can be constructed from raw input sequences as well as from pre-computed KMC counters) and transforming it to contigs with counts associated with their constituting k-mers:

metagraph transform -v -p 4 --to-fasta -o contigs sample_graph.dbg

Then, the contigs written to contigs.fasta.gz and the counts associated with their k-mers written to contigs.kmer_counts.gz can be used when constructing a count-aware graph annotation:

metagraph annotate -v -i graph.dbg --anno-filename --count-kmers -p 4 \
                      -o annotation contigs.fasta.gz

Warning

If something went wrong and no counts could be read from file contigs.kmer_counts.gz, a warning message [warning] No k-mer counts found ... will be printed and metagraph annotate will proceed with the assumption that the count of each k-mer is equal to 1.

Query k-mer counts

For querying k-mer counts (abundances), for example, to see how highly a gene is expressed in the indexed RNA-seq samples, the annotation should be transformed to a representation that combines both the binary annotation matrix (from *.column.annodbg) and the count values (from *.column.annodbg.counts). For more details, see section Convert count-aware annotations.

Once the annotation is transformed, k-mer abundances can be queried with:

metagraph query --query-mode counts ...

Note that if flag --query-mode counts is not passed, the index will be queried in the default k-mer presence/absence regime.

Index k-mer coordinates

Besides indexing k-mer counts, MetaGraph supports indexing k-mer coordinates, that is, their positions in the source input. These may represent positions in a genome, positions of a k-mer in a raw SRA experiment (say, each read has 70 k-mers in it; then the second k-mer of the third read has coordinate 211). Depending on the target application and the final goal, it is both possible to consider each sequence of the input as a separate label and index the coordinates of its k-mers separately or, for the other extreme, put everything into a single label and use the annotated coordinates of the k-mers to find the borders of each indexed sequence in post-processing query results. In all cases, it is possible to reconstruct the original input from indexes of this kind, which makes this indexing method fully lossless (see more details in paper https://www.biorxiv.org/content/10.1101/2021.11.09.467907).

To construct a MetaGraph index with k-mer coordinates (represented as a Counting de Bruijn graph), construct a de Bruijn graph as usual (see Construct graph) and then add --coordinates to the annotation command, e.g.:

metagraph annotate -v -i graph.dbg --anno-filename --coordinates -p 4 \
                      -o annotation transcripts_1000.fa

Along with the normal (binary) graph annotation annotation.column.annodbg, this command will create an array of corresponding k-mer coordinates annotation.column.annodbg.coords. Annotation in the --anno-header mode is also supported. In that case, a new annotation label will be created for every sequence in the input, and the first coordinate for every starting k-mer will be re-set to 0 for every sequence. Note, however, that this assumes that all sequence headers in the FASTA file are unique and do not repeat. If this condition is not met, an error will be returned.

All other flags (e.g., --separately and --disk-swap) described above are also supported similarly as for binary annotations.

Query k-mer coordinates

Once a coordinate-aware annotation is constructed, it can be transformed into a more memory-efficient representation supporting querying (see Convert coordinate-aware annotations below) and then queried with:

metagraph query --query-mode coords ...

As the coordinate-aware annotations also contain the information about k-mer abundance, they can be queried to retrieve k-mer counts (simply pass --query-mode counts instead of --query-mode coords). Note that if neither --query-mode coords nor --query-mode counts is passed, the index will be queried in the default k-mer presence/absence regime.

Transform annotation

To enhance the query performance and reduce the memory footprint, annotations can be converted to other representations.

There are several different annotation representations available in MetaGraph (see the possible values for flag --anno-type in metagraph transform_anno). For instance, Rainbowfish can be used to achieve a very fast query speed, but it can be applied only to relatively small problem instances (about 100 GB) because of the limited compression performance and the complexity of the construction algorithm. In contrast, RowDiff<Multi-BRWT> typically achieves the best compression while still providing a good query performance, and thus, it is recommended for very large problem instances. Finally, RowDiff<RowSparse> provides a good trade-off between the query speed and compression performance.

Convert annotation to Rainbowfish

The conversion to Rainbowfish consists of two steps.

  1. First, convert the column-compressed annotation to the row-major representation:

    find . -name "*.column.annodbg" | metagraph transform_anno -v \
                                                 --anno-type row \
                                                 -o annotation ...
    
  2. Then, transform the row-major annotation to the compressed Rainbowfish representation:

    metagraph transform_anno -v --anno-type rbfish \
                                -o annotation \
                                annotation.row.annodbg
    

Convert annotation to Multi-BRWT

The conversion to Multi-BRWT can be done either

  • with a single command, e.g.:

    find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \
                                                    --greedy -o anno
    
  • or with pre-computing a column clustering with:

    find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \
                                                    --linkage --greedy -o linkage.txt
    

    and next converting the annotation to Multi-BRWT according to the pre-computed clustering (linkage):

    find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \
                                                    --linkage-file linkage.txt -o anno
    

Note

If the clustering is too slow, it probably means it uses too many subsampled rows. In this case, consider changing the value passed with flag --subsample <INT>. The 1M rows subsampled by default are usually enough even for very large annotations. Increasing this value usually does not lead to any significantly better compression.

Finally, the internal structure of the BRWT tree can be relaxed (which is always recommended to do) to increase the arity of its internal nodes and enhance the compression:

metagraph relax_brwt -v -p 18 -o anno_relaxed anno.brwt.annodbg

Note

By default, metagraph transform_anno --anno-type brwt uses disk swap for temporary files created for each annotation column (label), which might be inappropriate when the number of columns is too large (around a million or more). In such cases, pass an additional flag --disk-swap "" to compute everything in-memory without creating temp files.

Convert annotation to RowDiff<Multi-BRWT>

The conversion to RowDiff<Multi-BRWT> is done in two steps.

  1. Transform annotation columns *.column.annodbg to row_diff in three stages:

    find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \
                                        --anno-type row_diff --row-diff-stage 0 \
                                        -i graph.dbg --mem-cap-gb 300
    
    find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \
                                        --anno-type row_diff --row-diff-stage 1 \
                                        -i graph.dbg --mem-cap-gb 300
    
    find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \
                                        --anno-type row_diff --row-diff-stage 2 \
                                        -i graph.dbg --mem-cap-gb 300
    

    Note that this requires to pass the graph graph.dbg as well in order to derive the topology for the diff-transform.

  2. Transform the diff-transformed columns *.row_diff.annodbg to Multi-BRWT:

    find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \
                                                    -i graph.dbg \
                                                    --anno-type row_diff_brwt \
                                                    --greedy ...
    metagraph relax_brwt -v -p 18 \
                         --relax-arity 32 \
                         -o annotation_relaxed \
                         annotation.row_diff_brwt.annodbg
    

    Also see the above paragraph Convert annotation to Multi-BRWT for other options.

Convert annotation to RowDiff<RowSparse>

The conversion to RowDiff<RowSparse> is similar to Convert annotation to RowDiff<Multi-BRWT>. The first step is the same. In the second step, the diff-transformed columns *.row_diff.annodbg are converted to RowSparse:

find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \
                                                -i graph.dbg \
                                                --anno-type row_diff_sparse

Convert count-aware annotations

Converting a graph annotation supplemented with k-mer counts (*.column.annodbg + *.column.annodbg.counts) to Int-Multi-BRWT (int_brwt) is done exactly the same way as converting a binary annotation to Multi-BRWT (see Convert annotation to Multi-BRWT), with simply replacing --anno-type brwt with --anno-type int_brwt:

metagraph transform_anno --anno-type int_brwt --greedy ...

For converting to RowDiff<Int-Multi-BRWT> (row_diff_int_brwt), perform the same steps as when converting to RowDiff<Multi-BRWT> with the following exceptions. First, an additional flag --count-kmers has to be passed on step 1 (the row-diff transform). Second, on step 2, delta-transformed columns have file extension .column.annodbg and not .row_diff.annodbg. These columns should be passed to the metagraph transform_anno --anno-type row_diff_int_brwt command. The corresponding transformed delta counts will be loaded automatically.

For further examples on real data, see https://github.com/ratschlab/counting_dbg/blob/master/scripts.md#index-with-k-mer-counts.

Convert coordinate-aware annotations

Conversion to column_coord is straightforward.

Conversion to brwt_coord is analogous to brwt and int_brwt.

Conversion to row_diff_brwt_coord is analogous to row_diff_brwt and row_diff_int_brwt, where an additional flag --coordinates has to be passed.

Additionally, one can convert the delta-transformed columns with coordinates (after step --anno-type row_diff --coordinates) directly to the ColumnCompressed format (row_diff_coord), equivalent to row_diff_brwt_coord with the arity set to infinity, that is, all leaves (original labels) directly connected to the root of the BRWT tree.

Query index

Using Command Line Interface

To query a MetaGraph index (graph + annotation) using the command line interface (CLI), run metagraph query, e.g.:

metagraph query -i graph.dbg \
                -a annotation.column.annodbg \
                --discovery-fraction 0.1 \
                transcripts_1000.fa

For alignment, see metagraph align.

To load up a MetaGraph index in server mode for querying it with the Python API or via HTTP requests, run:

metagraph server_query -i graph.dbg \
                       -a annotation.column.annodbg \
                       --port <PORT> \
                       --parallel <NUM_THREADS>

Using Python API

See Python API

Column operations

MetaGraph supports operations aggregating multiple annotation columns to compute statistics for the k-mers and their counts, e.g.:

metagraph transform_anno --aggregate-columns -o out \
                            --min-count 2 --max-count 2 \
                            A.column.annodbg \
                            B.column.annodbg

to construct a new annotation column out.column.annodbg which computes AND between columns A and B.

Despite the simplistic appearance of this command, it can also compute many other complex operations. Below we provide a few common examples of such aggregating operations.

Examples

  1. Select all “unique” k-mers, that is, appearing only in a single annotation column among columns in annotation annotation.column.annodbg:

    metagraph transform_anno --aggregate-columns -o out \
                             --max-count 1 annotation.column.annodbg
    
  2. Select all “common” k-mers, that is, appearing in at least 95% of annotation columns:

    metagraph transform_anno --aggregate-columns -o out \
                             --min-fraction 0.95 annotation.column.annodbg
    

Note

This command (metagraph transform_anno --aggregate-columns ...) only supports annotations in the ColumnCompressed format, that is, constructed by the metagraph annotate command. If the input columns have associated counts (e.g., constructed with metagraph annotate --count-kmers ...), they can be loaded and used by the aggregator as well.

Aggregating function

In general, the following formula is used for aggregation:

\[\text{min-count} <= \sum_i 1\{\text{min-value} <= c_i <= \text{max-value}\} <= \text{max-count},\]

where \(c_i\) is the count for the current k-mer (see Index k-mer counts). If no counts are associated with the column, \(c_i = 1\) for every set bit and \(0\) otherwise. If this sum falls within specified \(\text{min-count}\) and \(\text{max-count}\), the bit in the aggregated column for this k-mer is set to 1, and the value of the sum is written as the count associated with that bit.

In other words, the output aggregated column is always supplemented with a count vector, which can be interpreted as normal k-mer counts. For instance, consider the following example.

  1. Suppose, first we want to compute in how many columns each k-mer occurs:

    metagraph transform_anno --aggregate-columns -o out annotation.column.annodbg
    
  2. Then we can inspect the histogram of k-mer frequencies to, say, find an appropriate threshold for maximum frequency:

    metagraph stats -a out.column.annodbg --print-counts-hist
    
  3. Finally, after inspecting the histogram and selecting a reasonable threshold (suppose, we decided to filter out all k-mers that occur in more than 10 columns), we can apply it to the aggregated column as to a normal original column with counts:

    metagraph transform_anno --aggregate-columns -o rare_kmers \
                             --max-value 10 out.column.annodbg
    

which generates the final column rare_kmers.column.annodbg with the mask indicating all k-mers occurring in 10 or fewer input columns in the original file annotation.column.annodbg.