Quick start¶
MetaGraph constructs indexes composed of two main elements: a k-mer index and an annotation matrix.
The k-mer index stores all k-mers from the input sequences and represents a de Bruijn graph. This index serves as a dictionary mapping k-mers to their unique positive identifiers.
The second element is a matrix encoding the relation between the k-mers and their attributes. These relations may represent, for instance:
k-mer
i
is present in sequence/genomej
k-mer
i
is present in SRA samplej
k-mer
i
is present in filej
k-mer
i
is present in a collection of sequences/files marked byj
k-mer
i
is highly expressed in samplej
The annotation matrix can also be supplemented with additional attributes to represent such quantities as:
k-mer
i
occurs \(c_i\) times in samplej
(k-mer abundance)k-mer
i
occurs at positions \(p_1,\dots,p_{c_i}\) in genomej
(k-mer coordinates)
In the following, you will find simplified instructions and examples for constructing a MetaGraph index and querying it.
The indexing workflow in MetaGraph consists of two major steps: graph construction and annotation construction.
Construct graph¶
Basics¶
De Bruijn graphs are constructed with metagraph build
:
metagraph build --verbose --parallel 4 -k 31 --outfile-base graph transcripts_1000.fa
where transcripts_1000.fa
is a fasta/fastq file (may be gzipped) with input sequences. The same
command can be shortened to:
metagraph build -v -p 4 -k 31 -o graph transcripts_1000.fa
It is possible to pass multiple input files:
metagraph build -v -p 4 -k 31 -o graph file_1.fa file_2.fa ...
or pipe a list of files:
find . -name "*.fa" | metagraph build -v -p 4 -k 31 -o graph
These commands construct a single de Bruijn graph (a k-mer index) from all k-mers extracted from the input sequences.
To see the list of all available flags, type metagraph build
.
Note
There are various graph representations available in MetaGraph, which can be chosen via the flag --graph
.
However, the default succinct
representation is usually the best choice because of its great scalability (requires only 2-4 bits per k-mer) and the ability to search sub-k-mers/k-mer ranges, which is important for sequence alignment.
To check the statistics for a constructed graph, type:
metagraph stats graph.dbg
Construct with disk swap¶
For very large inputs, graphs can be constructed with disk swap to limit the RAM usage (currently, available only for succinct
graph representations).
For example, the size of the buffer kept in memory can be set to 4 GiB as follows:
metagraph build -v -k 31 -o graph -p 36 --disk-swap /tmp --mem-cap-gb 4 file.fa ...
Using larger buffers usually speeds up the construction. However, using a 50-80 GiB buffer is always sufficient, even when constructing graphs containing trillions of k-mers.
Construct from KMC counters¶
Apart from the standard FASTA/FASTQ/VCF input formats (uncompressed or gzipped), a graph can be constructed from k-mer counters generated by KMC.
KMC is an extremely efficient tool for counting k-mers, which can be used to quickly pre-process the input and deduplicate/count/filter the input k-mers. For example, the following commands can be used to construct a graph only from k-mers occurring at least 5 times in the input:
K=31
# count k-mers with KMC
./KMC/kmc -ci5 -t4 -k$K -m5 -fm SRR403017.fasta.gz SRR403017.cutoff_5 ./KMC
# build graph with MetaGraph
metagraph build -v -p 4 -k $K -o graph SRR403017.cutoff_5.kmc_pre
Note
In the above example, we use ./KMC
as the directory where KMC will store its
intermediate caches. Depending on your input data, this directory should be at a location
with a sufficient amount of free intermediate storage space.
Construct weighted graph¶
Since KMC does compute the exact counts of all k-mers from the input, this information can be taken into account when transforming into a de Bruijn graph.
Specifically, a node weight (k-mer count) can be associated with every k-mer from the graph.
All node weights supplement the graph in the form of an integer vector graph.dbg.weights
,
which can be constructed by simply adding the flag --count-kmers
to the graph construction command:
metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.cutoff_5.kmc_pre
which will construct a de Bruijn graph graph.dbg
and an array of counts counts.dbg.weights
associated with its k-mers.
This supplementing vector of counts can then be used for graph cleaning (see Graph cleaning) or indexing k-mer counts
with pre-counting (see Annotate with pre-counting).
Note
A weighted graph can also be constructed directly from raw input sequences, without pre-counting with KMC, e.g.,:
metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.fasta.gz
This should be used when pre-processing with KMC is complicated or impossible, e.g., when indexing protein sequences.
Transform to other representations¶
To transform a succinct
graph into a more compressed and smaller representation, run:
metagraph transform -v --state small -p 4 -o graph_small graph.dbg
Transform to sequences¶
To transform a graph back to sequences, it can be traversed to extract all its contigs:
metagraph transform -v --to-fasta -o contigs -p 4 graph.dbg
These sequences contain all k-mers indexed in the graph exactly once and can be used as their non-redundant (deduplicated) representation.
The assembled contigs are written to a compressed FASTA file, which can be inspected with:
zless contigs.fasta.gz
To extract all unitigs (linear paths in the graph) instead of contigs, add flag --unitigs
to the same metagraph transform
command.
Note
If the source de Bruijn graph is weighted (see section Construct weighted graph), the contigs (written to file *.fasta.gz
)
will be extracted along with the counts of their constituting k-mers, written to file *.kmer_counts.gz
.
Construct canonical graph¶
When the input sequences are raw reads of unknown directionality (strandedness), it is natural to index along with each sequence its reverse complement.
MetaGraph has a special graph mode where each k-mer indexed in the graph automatically adds its reverse complement k-mer to the index. To build a canonical graph from a set of reads/sequences, add --mode canonical
to the build command:
find . -name "*.fa" | metagraph build -v -p 4 -k 31 -o graph --mode canonical
Important
Canonical graphs should not be used in combination with delta-coded annotations of type RowDiff<*>
.
For canonical graphs, only half of the k-mers are annotated, which creates a huge number of “gaps” in
the annotation, diminishing the effectiveness of the coding. Instead, canonical graphs should always
be transformed to primary (see section Construct primary graph below) before annotating them.
Construct primary graph¶
Canonical graphs contain each k-mer in both of its forms (forward and reverse complement), but the same data structure can be modeled by storing only one of them, implicitly modeling the other.
Often, different tools achieve this by only storing the lexicographically smallest of the two
k-mers. However, it is not possible to efficiently implement this with the succinct
graph representation.
Hence, we relax this constraint and pick any of the two forms of each k-mer.
In a nutshell, this representation is constructed by fully traversing the canonical graph and marking a k-mer as primary if it was reached before its reverse complement in the traversal.
The graph containing only primary k-mers is called a primary graph.
The algorithm for primarization of a canonical graph is as follows:
First, extract a set of primary contigs (stretches of primary k-mers) from the canonical graph:
metagraph transform -v --to-fasta --primary-kmers -o primary_contigs -p 4 graph.dbg
Then, construct a new graph from the primary contigs and mark this graph as primary by adding
--mode primary
to the build command:metagraph build -v -p 4 -k 31 -o graph_primary --mode primary primary_contigs.fasta.gz
Now, this new graph graph_primary.dbg
emulates the original canonical graph (e.g., when querying
or annotating). It represents the same information as the original canonical graph, while taking only
half of the space.
Graph cleaning¶
For removing sequencing errors, MetaGraph provides routines for graph cleaning and k-mer filtering. These are based on the assumption that k-mers with relatively low abundance (low k-mer counts) in the input data were likely generated due to sequencing errors, and hence should be dropped. Moreover, to make the cleaning procedure more robust, the decision about filtering out a k-mer can be based on the median abundance of the unitig to which this k-mer belongs. That is, k-mers with low abundance are preserved if they are situated in a unitig with sufficiently many highly abundant k-mers.
K=31
metagraph build -v -p 4 -k $K --count-kmers -o graph SRR403017.fasta.gz
metagraph clean -v -p 4 --to-fasta --prune-tips $((2*$K)) --prune-unitigs 0 --fallback 2 \
-o SRR403017_clean_contigs graph.dbg
zless SRR403017_clean_contigs.fasta.gz
Note
The default parameters in metagraph clean
correspond to no cleaning. That is, an equivalent of metagraph transform --to-fasta
, which extracts from the input de Bruijn graph all contigs, without removing any k-mers.
For cleaning graphs constructed from high-throughput Illumina reads, the recommended parameters are
--prune-tips <2k> --prune-unitigs 0 --fallback 2
, which implements the cleaning procedure proposed in McCortex (Turner et al., 2018) and includes the following steps:
Prune all tips shorter than 2k, where k is the k-mer length.
Compute a threshold for the minimum k-mer abundance as follows. Assume the number of k-mers with sequencing errors (erroneous k-mers) follows a Poisson distribution with a Gamma distributed mean. Also, assume that all k-mers with an abundance of 3 or less are generated due to sequencing errors. Based on these numbers, fit a Poisson distribution and pick a threshold such that k-mers predicted to be erroneous make up at most 0.1% of the total k-mer coverage at that abundance level. If the chosen threshold keeps less than 20% of the total coverage, deem the automatic estimation procedure unsuccessful and use the fallback value of 2 instead (set by flag
--fallback
).Traverse the graph (where all short tips have already been removed in step 1) and fetch all unitigs with a median k-mer abundance greater or equal to the threshold defined in step 2.
Once all clean contigs (or unitigs) are extracted from a de Bruijn graph, construct a clean de Bruijn graph from them.
Tip
When indexing multiple read sets, the recommended workflow is to build a sample de Bruijn graph from each read set separately and clean these sample graphs independently (that is, extract clean contigs from each of them). Next, build a joint de Bruijn graph from all these clean contigs and finally annotate it using the generated clean contig sets instead of the original raw read sets.
Annotate graph¶
Once a graph is constructed, there are multiple ways to construct the corresponding annotation to encode its metadata.
Annotate sequence headers¶
For annotating each sequence with its header in the fasta/fastq file, run
metagraph annotate -v -i graph.dbg --anno-header -o annotation transcripts_1000.fa
This is a common annotation scenario when indexing reference sequences or assembled genomes.
To check the statistics for the constructed annotation, type:
metagraph stats -a annotation.column.annodbg
All annotation labels (column names) for an annotation matrix can be printed with:
metagraph stats --print-col-names -a annotation.column.annodbg
Annotate source filenames¶
To label all k-mers from the same file with a common label (for instance for the experiment discovery problem), the command is:
metagraph annotate -v -i graph.dbg --anno-filename -o annotation file_1.fa file_2.fa ...
which will annotate k-mers from the first file by label file_1.fa
, k-mers from the second file by label file_2.fa
, etc.
Annotate with disk swap¶
When the input files and the output annotation are very large, disk swap space can be used
by setting flags --disk-swap
and --mem-cap-gb
, to limit the size of internal buffers
and reduce RAM usage during annotation construction:
metagraph annotate -v -i graph.dbg --anno-filename --disk-swap /tmp --mem-cap-gb 1 \
-o annotation file_1.fa file_2.fa ...
Annotate files independently¶
It is recommended to independently construct a single annotation column per each input file.
To do this in parallel and avoid loading the same graph multiple times, run one annotation
command with the flag --separately
added:
metagraph annotate -v -i graph.dbg --anno-filename --separately -p 4 --threads-each 9 \
-o annotation file_1.fa file_2.fa ...
This will create a new directory annotation/
with individual annotation columns:
file_1.fa.column.annodbg file_2.fa.column.annodbg ...
Note, in the command above we passed -p 4 --threads-each 9
to annotate 4 files at a time,
in parallel, where each uses 9 threads. Thus, this uses 36 threads in total.
Tip
It is recommended to run annotation from a set of long (primary) contigs/unitigs,
where all k-mers have already been deduplicated, especially when annotating a (primary) graph
in the succinct
representation. In contrast, annotating a succinct
graph from
separate k-mers (especially not deduplicated) will take orders of magnitude longer.
The contigs serve as an equivalent non-redundant representation of the k-mer sets and, thus,
result in the same graph annotation.
Thus, in practice, for large inputs, it is recommended to construct
individual (canonical) de Bruijn graphs from all read sets, called sample graphs, and
transform them to contigs. These contig sets are then used instead of the original read
sets to construct and annotate the joint (primary) graph.
Annotate from KMC counters¶
This might depend on the particular graph representation used to store the joint graph.
However, with the succinct
graph representation, it is never efficient to annotate a graph directly
from KMC counters because in their format k-mers are not ordered, which leads to many random k-mer
lookups in the BOSS table.
There is, however, an extra pre-processing step, which makes this task efficiently solvable. First, one can assemble a graph from the KMC counter (see Construct from KMC counters) and extract contigs from it (see Transform to sequences or Graph cleaning). Next, annotate the graph using these assembled contigs as a normal FASTA file instead of the original KMC counter.
Tip
Together, Construct from KMC counters and Annotate from KMC counters provide an efficient algorithm for constructing an annotated graph from many input files with sequences (e.g., indexing SRA experiments). Namely, this algorithm includes the following steps.
Deduplicate all k-mers in all files with KMC (construct KMC counters, possibly with filtering by abundance)
Extract contigs from each sample graph (possibly primary contigs, can also be combined with graph cleaning)
Build a large joint graph from all extracted contigs (possibly, in the primary mode)
Annotate this joint graph using the same files with contigs
Annotate using custom labels¶
To add a custom annotation label for all k-mers from an input file, add --anno-label <LABEL_NAME>
when annotating the graph.
Index k-mer counts¶
MetaGraph supports indexing k-mer counts (k-mer abundances), e.g., to represent gene expression in RNA-seq data.
The counts can supplement graphs in any representation.
To construct a MetaGraph index with k-mer counts (Counting de Bruijn graph), construct a de Bruijn graph as usual (see Construct graph)
and then add --count-kmers
to the annotation command, e.g.:
metagraph annotate -v -i graph.dbg --anno-filename --count-kmers -p 4 \
-o annotation transcripts_1000.fa
Along with the normal (binary) graph annotation annotation.column.annodbg
, this command will also create an
array of corresponding k-mer counts annotation.column.annodbg.counts
.
These counts represent how many times each k-mer indexed in graph.dbg
occurs in the input file transcripts_1000.fa
.
Note
By default, each count is stored in an 8-bit integer, and all counts greater than 255 are clipped.
This value, however, can be changed with the flag --count-width
, to represent counts greater than 255 or, the other way around,
clip all large counts when only lowly abundant k-mers are of interest.
All other flags (e.g., --separately
and --disk-swap
) described above are also supported similarly as for binary annotations.
The histogram for indexed k-mer counts can be viewed with:
metagraph stats -a annotation.column.annodbg --print-counts-hist
It is also possible to compute and print quantiles of all indexed counts. For instance, type the following command to compute the minimum non-zero count, as well as the median and the maximum count:
metagraph stats -a annotation.column.annodbg --count-quantiles '0 0.5 1'
Annotate with pre-counting¶
Similarly to the case of simple binary annotations considered above, it is recommended to pre-count k-mers for each annotation label (typically, sequencing sample) before annotating it. This technique consists in first constructing a weighted de Bruijn graph (see Construct weighted graph and note that it can be constructed from raw input sequences as well as from pre-computed KMC counters) and transforming it to contigs with counts associated with their constituting k-mers:
metagraph transform -v -p 4 --to-fasta -o contigs sample_graph.dbg
Then, the contigs written to contigs.fasta.gz
and the counts associated with their k-mers written to contigs.kmer_counts.gz
can be used when constructing a count-aware graph annotation:
metagraph annotate -v -i graph.dbg --anno-filename --count-kmers -p 4 \
-o annotation contigs.fasta.gz
Warning
If something went wrong and no counts could be read from file contigs.kmer_counts.gz
, a warning message
[warning] No k-mer counts found ...
will be printed and metagraph annotate
will proceed with the assumption that the count of each k-mer is equal to 1.
Query k-mer counts¶
For querying k-mer counts (abundances), for example, to see how highly a gene is expressed in the indexed RNA-seq samples,
the annotation should be transformed to a representation that combines both the binary annotation matrix
(from *.column.annodbg
) and the count values (from *.column.annodbg.counts
).
For more details, see section Convert count-aware annotations.
Once the annotation is transformed, k-mer abundances can be queried with:
metagraph query --query-mode counts ...
Note that if flag --query-mode counts
is not passed, the index will be queried in the default k-mer presence/absence regime.
Index k-mer coordinates¶
Besides indexing k-mer counts, MetaGraph supports indexing k-mer coordinates, that is, their positions in the source input. These may represent positions in a genome, positions of a k-mer in a raw SRA experiment (say, each read has 70 k-mers in it; then the second k-mer of the third read has coordinate 211). Depending on the target application and the final goal, it is both possible to consider each sequence of the input as a separate label and index the coordinates of its k-mers separately or, for the other extreme, put everything into a single label and use the annotated coordinates of the k-mers to find the borders of each indexed sequence in post-processing query results. In all cases, it is possible to reconstruct the original input from indexes of this kind, which makes this indexing method fully lossless (see more details in paper https://www.biorxiv.org/content/10.1101/2021.11.09.467907).
To construct a MetaGraph index with k-mer coordinates (represented as a Counting de Bruijn graph), construct a de Bruijn graph
as usual (see Construct graph) and then add --coordinates
to the annotation command, e.g.:
metagraph annotate -v -i graph.dbg --anno-filename --coordinates -p 4 \
-o annotation transcripts_1000.fa
Along with the normal (binary) graph annotation annotation.column.annodbg
, this command will create an array of corresponding
k-mer coordinates annotation.column.annodbg.coords
. Annotation in the --anno-header
mode is also supported. In that case,
a new annotation label will be created for every sequence in the input, and the first coordinate for every starting k-mer
will be re-set to 0 for every sequence. Note, however, that this assumes that all sequence headers in the FASTA file are unique
and do not repeat. If this condition is not met, an error will be returned.
All other flags (e.g., --separately
and --disk-swap
) described above are also supported similarly as for binary annotations.
Query k-mer coordinates¶
Once a coordinate-aware annotation is constructed, it can be transformed into a more memory-efficient representation supporting querying (see Convert coordinate-aware annotations below) and then queried with:
metagraph query --query-mode coords ...
As the coordinate-aware annotations also contain the information about k-mer abundance, they can be queried to retrieve k-mer counts (simply pass --query-mode counts
instead of --query-mode coords
).
Note that if neither --query-mode coords
nor --query-mode counts
is passed, the index will be queried in the default k-mer presence/absence regime.
Transform annotation¶
To enhance the query performance and reduce the memory footprint, annotations can be converted to other representations.
There are several different annotation representations available in MetaGraph (see the possible values for flag --anno-type
in metagraph transform_anno
).
For instance, Rainbowfish
can be used to achieve a very fast query speed, but it can
be applied only to relatively small problem instances (about 100 GB) because of the limited
compression performance and the complexity of the construction algorithm.
In contrast, RowDiff<Multi-BRWT>
typically achieves
the best compression while still providing a good query performance, and thus, it is
recommended for very large problem instances.
Finally, RowDiff<RowSparse>
provides a good trade-off between the query speed and compression performance.
Convert annotation to Rainbowfish¶
The conversion to Rainbowfish consists of two steps.
First, convert the column-compressed annotation to the row-major representation:
find . -name "*.column.annodbg" | metagraph transform_anno -v \ --anno-type row \ -o annotation ...
Then, transform the row-major annotation to the compressed Rainbowfish representation:
metagraph transform_anno -v --anno-type rbfish \ -o annotation \ annotation.row.annodbg
Convert annotation to Multi-BRWT¶
The conversion to Multi-BRWT
can be done either
with a single command, e.g.:
find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \ --greedy -o anno
or with pre-computing a column clustering with:
find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \ --linkage --greedy -o linkage.txt
and next converting the annotation to Multi-BRWT according to the pre-computed clustering (linkage):
find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \ --linkage-file linkage.txt -o anno
Note
If the clustering is too slow, it probably means it uses too many subsampled rows. In this case, consider
changing the value passed with flag --subsample <INT>
. The 1M rows subsampled by default are usually enough
even for very large annotations. Increasing this value usually does not lead to any significantly better compression.
Finally, the internal structure of the BRWT tree can be relaxed (which is always recommended to do) to increase the arity of its internal nodes and enhance the compression:
metagraph relax_brwt -v -p 18 -o anno_relaxed anno.brwt.annodbg
Note
By default, metagraph transform_anno --anno-type brwt
uses disk swap for temporary files created for each annotation
column (label), which might be inappropriate when the number of columns is too large (around a million or more).
In such cases, pass an additional flag --disk-swap ""
to compute everything in-memory without creating temp files.
Convert annotation to RowDiff<Multi-BRWT>¶
The conversion to RowDiff<Multi-BRWT>
is done in two steps.
Transform annotation columns
*.column.annodbg
torow_diff
in three stages:find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \ --anno-type row_diff --row-diff-stage 0 \ -i graph.dbg --mem-cap-gb 300 find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \ --anno-type row_diff --row-diff-stage 1 \ -i graph.dbg --mem-cap-gb 300 find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \ --anno-type row_diff --row-diff-stage 2 \ -i graph.dbg --mem-cap-gb 300
Note that this requires to pass the graph
graph.dbg
as well in order to derive the topology for the diff-transform.Transform the diff-transformed columns
*.row_diff.annodbg
toMulti-BRWT
:find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \ -i graph.dbg \ --anno-type row_diff_brwt \ --greedy ... metagraph relax_brwt -v -p 18 \ --relax-arity 32 \ -o annotation_relaxed \ annotation.row_diff_brwt.annodbg
Also see the above paragraph Convert annotation to Multi-BRWT for other options.
Convert annotation to RowDiff<RowSparse>¶
The conversion to RowDiff<RowSparse>
is similar to Convert annotation to RowDiff<Multi-BRWT>. The first step is the same.
In the second step, the diff-transformed columns *.row_diff.annodbg
are converted to RowSparse
:
find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \
-i graph.dbg \
--anno-type row_diff_sparse
Convert count-aware annotations¶
Converting a graph annotation supplemented with k-mer counts (*.column.annodbg
+ *.column.annodbg.counts
)
to Int-Multi-BRWT (int_brwt
) is done exactly the same way as converting a binary annotation to Multi-BRWT (see Convert annotation to Multi-BRWT),
with simply replacing --anno-type brwt
with --anno-type int_brwt
:
metagraph transform_anno --anno-type int_brwt --greedy ...
For converting to RowDiff<Int-Multi-BRWT> (row_diff_int_brwt
), perform the same steps as when
converting to RowDiff<Multi-BRWT> with the following exceptions.
First, an additional flag --count-kmers
has to be passed on step 1 (the row-diff transform).
Second, on step 2, delta-transformed columns have file extension .column.annodbg
and not .row_diff.annodbg
.
These columns should be passed to the metagraph transform_anno --anno-type row_diff_int_brwt
command.
The corresponding transformed delta counts will be loaded automatically.
For further examples on real data, see https://github.com/ratschlab/counting_dbg/blob/master/scripts.md#index-with-k-mer-counts.
Convert coordinate-aware annotations¶
Conversion to column_coord
is straightforward.
Conversion to brwt_coord
is analogous to brwt
and int_brwt
.
Conversion to row_diff_brwt_coord
is analogous to row_diff_brwt
and row_diff_int_brwt
, where an additional flag --coordinates
has to be passed.
Additionally, one can convert the delta-transformed columns with coordinates (after step --anno-type row_diff --coordinates
)
directly to the ColumnCompressed format (row_diff_coord
), equivalent to row_diff_brwt_coord
with the arity set to infinity,
that is, all leaves (original labels) directly connected to the root of the BRWT tree.
Query index¶
Using Command Line Interface¶
To query a MetaGraph index (graph + annotation) using the command line interface (CLI), run metagraph query
, e.g.:
metagraph query -i graph.dbg \
-a annotation.column.annodbg \
--discovery-fraction 0.1 \
transcripts_1000.fa
For alignment, see metagraph align
.
To load up a MetaGraph index in server mode for querying it with the Python API or via HTTP requests, run:
metagraph server_query -i graph.dbg \
-a annotation.column.annodbg \
--port <PORT> \
--parallel <NUM_THREADS>
Using Python API¶
See Python API
Column operations¶
MetaGraph supports operations aggregating multiple annotation columns to compute statistics for the k-mers and their counts, e.g.:
metagraph transform_anno --aggregate-columns -o out \
--min-count 2 --max-count 2 \
A.column.annodbg \
B.column.annodbg
to construct a new annotation column out.column.annodbg
which computes AND between columns A and B.
Despite the simplistic appearance of this command, it can also compute many other complex operations. Below we provide a few common examples of such aggregating operations.
Examples¶
Select all “unique” k-mers, that is, appearing only in a single annotation column among columns in annotation
annotation.column.annodbg
:metagraph transform_anno --aggregate-columns -o out \ --max-count 1 annotation.column.annodbg
Select all “common” k-mers, that is, appearing in at least 95% of annotation columns:
metagraph transform_anno --aggregate-columns -o out \ --min-fraction 0.95 annotation.column.annodbg
Note
This command (metagraph transform_anno --aggregate-columns ...
) only supports annotations
in the ColumnCompressed format, that is, constructed by the metagraph annotate
command.
If the input columns have associated counts (e.g., constructed with metagraph annotate --count-kmers ...
),
they can be loaded and used by the aggregator as well.
Aggregating function¶
In general, the following formula is used for aggregation:
where \(c_i\) is the count for the current k-mer (see Index k-mer counts). If no counts are associated with the column, \(c_i = 1\) for every set bit and \(0\) otherwise. If this sum falls within specified \(\text{min-count}\) and \(\text{max-count}\), the bit in the aggregated column for this k-mer is set to 1, and the value of the sum is written as the count associated with that bit.
In other words, the output aggregated column is always supplemented with a count vector, which can be interpreted as normal k-mer counts. For instance, consider the following example.
Suppose, first we want to compute in how many columns each k-mer occurs:
metagraph transform_anno --aggregate-columns -o out annotation.column.annodbg
Then we can inspect the histogram of k-mer frequencies to, say, find an appropriate threshold for maximum frequency:
metagraph stats -a out.column.annodbg --print-counts-hist
Finally, after inspecting the histogram and selecting a reasonable threshold (suppose, we decided to filter out all k-mers that occur in more than 10 columns), we can apply it to the aggregated column as to a normal original column with counts:
metagraph transform_anno --aggregate-columns -o rare_kmers \ --max-value 10 out.column.annodbg
which generates the final column rare_kmers.column.annodbg
with the mask indicating all k-mers occurring in 10 or fewer input columns in the original file annotation.column.annodbg
.