.. _quick_start: Quick start =========== MetaGraph constructs indexes composed of two main elements: a k-mer index and an annotation matrix. The k-mer index stores all k-mers from the input sequences and represents a de Bruijn graph. This index serves as a dictionary mapping k-mers to their unique positive identifiers. .. It can also be used to map sub-k-mers (or spaced k-mers) to ranges of their identifiers (see TODO). The second element is a matrix encoding the relation between the k-mers and their attributes. These relations may represent, for instance: * k-mer ``i`` is present in sequence/genome ``j`` * k-mer ``i`` is present in SRA sample ``j`` * k-mer ``i`` is present in file ``j`` * k-mer ``i`` is present in a collection of sequences/files marked by ``j`` * k-mer ``i`` is highly expressed in sample ``j`` The annotation matrix can also be supplemented with additional attributes to represent such quantities as: * k-mer ``i`` occurs :math:`c_i` times in sample ``j`` (k-mer abundance) * k-mer ``i`` occurs at positions :math:`p_1,\dots,p_{c_i}` in genome ``j`` (k-mer coordinates) .. TODO: Describe counts/coordinate annotation In the following, you will find simplified instructions and examples for constructing a MetaGraph index and querying it. The indexing workflow in MetaGraph consists of two major steps: graph construction and annotation construction. .. _construct graph: Construct graph --------------- Basics ^^^^^^ De Bruijn graphs are constructed with ``metagraph build``:: metagraph build --verbose --parallel 4 -k 31 --outfile-base graph transcripts_1000.fa where ``transcripts_1000.fa`` is a fasta/fastq file (may be gzipped) with input sequences. The same command can be shortened to:: metagraph build -v -p 4 -k 31 -o graph transcripts_1000.fa It is possible to pass multiple input files:: metagraph build -v -p 4 -k 31 -o graph file_1.fa file_2.fa ... or pipe a list of files:: find . -name "*.fa" | metagraph build -v -p 4 -k 31 -o graph These commands construct a single de Bruijn graph (a k-mer index) from all k-mers extracted from the input sequences. To see the list of all available flags, type ``metagraph build``. .. note:: There are various graph representations available in MetaGraph, which can be chosen via the flag ``--graph``. However, the default ``succinct`` representation is usually the best choice because of its great scalability (requires only 2-4 bits per k-mer) and the ability to search sub-k-mers/k-mer ranges, which is important for sequence alignment. To check the statistics for a constructed graph, type:: metagraph stats graph.dbg Construct with disk swap """""""""""""""""""""""" For very large inputs, graphs can be constructed with disk swap to limit the RAM usage (currently, available only for ``succinct`` graph representations). For example, the size of the buffer kept in memory can be set to 4 GiB as follows:: metagraph build -v -k 31 -o graph -p 36 --disk-swap /tmp --mem-cap-gb 4 file.fa ... Using larger buffers usually speeds up the construction. However, using a 50-80 GiB buffer is always sufficient, even when constructing graphs containing trillions of k-mers. .. _construct_from_KMC: Construct from KMC counters ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Apart from the standard FASTA/FASTQ/VCF input formats (uncompressed or gzipped), a graph can be constructed from k-mer counters generated by `KMC `_. KMC is an extremely efficient tool for counting k-mers, which can be used to quickly pre-process the input and deduplicate/count/filter the input k-mers. For example, the following commands can be used to construct a graph only from k-mers occurring at least 5 times in the input:: K=31 # count k-mers with KMC ./KMC/kmc -ci5 -t4 -k$K -m5 -fm SRR403017.fasta.gz SRR403017.cutoff_5 ./KMC # build graph with MetaGraph metagraph build -v -p 4 -k $K -o graph SRR403017.cutoff_5.kmc_pre .. note:: In the above example, we use ``./KMC`` as the directory where KMC will store its intermediate caches. Depending on your input data, this directory should be at a location with a sufficient amount of free intermediate storage space. .. _construct_weighted_graph: Construct weighted graph """""""""""""""""""""""" Since KMC does compute the exact counts of all k-mers from the input, this information can be taken into account when transforming into a de Bruijn graph. Specifically, a node weight (k-mer count) can be associated with every k-mer from the graph. All node weights supplement the graph in the form of an integer vector ``graph.dbg.weights``, which can be constructed by simply adding the flag ``--count-kmers`` to the graph construction command:: metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.cutoff_5.kmc_pre which will construct a de Bruijn graph ``graph.dbg`` and an array of counts ``counts.dbg.weights`` associated with its k-mers. This supplementing vector of counts can then be used for graph cleaning (see :ref:`graph_cleaning`) or indexing k-mer counts with pre-counting (see :ref:`annotate_with_precounting`). .. note:: A weighted graph can also be constructed directly from raw input sequences, without pre-counting with KMC, e.g.,:: metagraph build -v -p 4 -k 31 --count-kmers -o graph SRR403017.fasta.gz This should be used when pre-processing with KMC is complicated or impossible, e.g., when indexing protein sequences. Transform to other representations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To transform a ``succinct`` graph into a more compressed and smaller representation, run:: metagraph transform -v --state small -p 4 -o graph_small graph.dbg .. _to_sequences: Transform to sequences ^^^^^^^^^^^^^^^^^^^^^^ To transform a graph back to sequences, it can be traversed to extract all its contigs:: metagraph transform -v --to-fasta -o contigs -p 4 graph.dbg These sequences contain all k-mers indexed in the graph exactly once and can be used as their non-redundant (deduplicated) representation. The assembled contigs are written to a compressed FASTA file, which can be inspected with:: zless contigs.fasta.gz To extract all unitigs (linear paths in the graph) instead of contigs, add flag ``--unitigs`` to the same ``metagraph transform`` command. .. note:: If the source de Bruijn graph is *weighted* (see section :ref:`construct_weighted_graph`), the contigs (written to file ``*.fasta.gz``) will be extracted along with the counts of their constituting k-mers, written to file ``*.kmer_counts.gz``. Construct canonical graph ^^^^^^^^^^^^^^^^^^^^^^^^^ When the input sequences are raw reads of unknown directionality (strandedness), it is natural to index along with each sequence its reverse complement. MetaGraph has a special graph mode where each k-mer indexed in the graph automatically adds its reverse complement k-mer to the index. To build a canonical graph from a set of reads/sequences, add ``--mode canonical`` to the build command:: find . -name "*.fa" | metagraph build -v -p 4 -k 31 -o graph --mode canonical .. important:: Canonical graphs should not be used in combination with delta-coded annotations of type ``RowDiff<*>``. For canonical graphs, only half of the k-mers are annotated, which creates a huge number of "gaps" in the annotation, diminishing the effectiveness of the coding. Instead, *canonical* graphs should always be transformed to *primary* (see section :ref:`construct_primary_graph` below) before annotating them. .. _construct_primary_graph: Construct primary graph ^^^^^^^^^^^^^^^^^^^^^^^ Canonical graphs contain each k-mer in both of its forms (forward and reverse complement), but the same data structure can be modeled by storing only one of them, implicitly modeling the other. Often, different tools achieve this by only storing the lexicographically smallest of the two k-mers. However, it is not possible to efficiently implement this with the ``succinct`` graph representation. Hence, we relax this constraint and pick *any* of the two forms of each k-mer. In a nutshell, this representation is constructed by fully traversing the canonical graph and marking a k-mer as *primary* if it was reached before its reverse complement in the traversal. The graph containing only primary k-mers is called a *primary* graph. The algorithm for primarization of a canonical graph is as follows: 1. First, extract a set of primary contigs (stretches of primary k-mers) from the canonical graph:: metagraph transform -v --to-fasta --primary-kmers -o primary_contigs -p 4 graph.dbg 2. Then, construct a new graph from the primary contigs and mark this graph as *primary* by adding ``--mode primary`` to the build command:: metagraph build -v -p 4 -k 31 -o graph_primary --mode primary primary_contigs.fasta.gz Now, this new graph ``graph_primary.dbg`` emulates the original canonical graph (e.g., when querying or annotating). It represents the same information as the original canonical graph, while taking only half of the space. .. _graph_cleaning: Graph cleaning ^^^^^^^^^^^^^^ For removing sequencing errors, MetaGraph provides routines for graph cleaning and k-mer filtering. These are based on the assumption that k-mers with relatively low abundance (low k-mer counts) in the input data were likely generated due to sequencing errors, and hence should be dropped. Moreover, to make the cleaning procedure more robust, the decision about filtering out a k-mer can be based on the median abundance of the unitig to which this k-mer belongs. That is, k-mers with low abundance are preserved if they are situated in a unitig with sufficiently many highly abundant k-mers. :: K=31 metagraph build -v -p 4 -k $K --count-kmers -o graph SRR403017.fasta.gz metagraph clean -v -p 4 --to-fasta --prune-tips $((2*$K)) --prune-unitigs 0 --fallback 2 \ -o SRR403017_clean_contigs graph.dbg zless SRR403017_clean_contigs.fasta.gz .. note:: The default parameters in ``metagraph clean`` correspond to no cleaning. That is, an equivalent of ``metagraph transform --to-fasta``, which extracts from the input de Bruijn graph all contigs, without removing any k-mers. For cleaning graphs constructed from high-throughput Illumina reads, the recommended parameters are ``--prune-tips <2k> --prune-unitigs 0 --fallback 2``, which implements the cleaning procedure proposed in `McCortex `_ (Turner et al., 2018) and includes the following steps: 1. Prune all tips shorter than *2k*, where *k* is the k-mer length. 2. Compute a threshold for the minimum k-mer abundance as follows. Assume the number of k-mers with sequencing errors (erroneous k-mers) follows a Poisson distribution with a Gamma distributed mean. Also, assume that all k-mers with an abundance of 3 or less are generated due to sequencing errors. Based on these numbers, fit a Poisson distribution and pick a threshold such that k-mers predicted to be erroneous make up at most 0.1% of the total k-mer coverage at that abundance level. If the chosen threshold keeps less than 20% of the total coverage, deem the automatic estimation procedure unsuccessful and use the fallback value of 2 instead (set by flag ``--fallback``). 3. Traverse the graph (where all short tips have already been removed in step 1) and fetch all unitigs with a median k-mer abundance greater or equal to the threshold defined in step 2. Once all clean contigs (or unitigs) are extracted from a de Bruijn graph, construct a clean de Bruijn graph from them. .. tip:: When indexing multiple read sets, the recommended workflow is to build a *sample de Bruijn graph* from each read set separately and clean these sample graphs independently (that is, extract clean contigs from each of them). Next, build a joint de Bruijn graph from all these clean contigs and finally annotate it using the generated clean contig sets instead of the original raw read sets. Annotate graph -------------- Once a graph is constructed, there are multiple ways to construct the corresponding annotation to encode its metadata. Annotate sequence headers ^^^^^^^^^^^^^^^^^^^^^^^^^ For annotating each sequence with its header in the fasta/fastq file, run :: metagraph annotate -v -i graph.dbg --anno-header -o annotation transcripts_1000.fa This is a common annotation scenario when indexing reference sequences or assembled genomes. To check the statistics for the constructed annotation, type:: metagraph stats -a annotation.column.annodbg All annotation labels (column names) for an annotation matrix can be printed with:: metagraph stats --print-col-names -a annotation.column.annodbg Annotate source filenames ^^^^^^^^^^^^^^^^^^^^^^^^^ To label all k-mers from the same file with a common label (for instance for the experiment discovery problem), the command is:: metagraph annotate -v -i graph.dbg --anno-filename -o annotation file_1.fa file_2.fa ... which will annotate k-mers from the first file by label ``file_1.fa``, k-mers from the second file by label ``file_2.fa``, etc. Annotate with disk swap """"""""""""""""""""""" When the input files and the output annotation are very large, disk swap space can be used by setting flags ``--disk-swap`` and ``--mem-cap-gb``, to limit the size of internal buffers and reduce RAM usage during annotation construction:: metagraph annotate -v -i graph.dbg --anno-filename --disk-swap /tmp --mem-cap-gb 1 \ -o annotation file_1.fa file_2.fa ... Annotate files independently """""""""""""""""""""""""""" It is recommended to independently construct a single annotation column per each input file. To do this in parallel and avoid loading the same graph multiple times, run one annotation command with the flag ``--separately`` added:: metagraph annotate -v -i graph.dbg --anno-filename --separately -p 4 --threads-each 9 \ -o annotation file_1.fa file_2.fa ... This will create a new directory ``annotation/`` with individual annotation columns:: file_1.fa.column.annodbg file_2.fa.column.annodbg ... Note, in the command above we passed ``-p 4 --threads-each 9`` to annotate 4 files at a time, in parallel, where each uses 9 threads. Thus, this uses 36 threads in total. .. tip:: It is recommended to run annotation from a set of long (primary) contigs/unitigs, where all k-mers have already been deduplicated, especially when annotating a (primary) graph in the ``succinct`` representation. In contrast, annotating a ``succinct`` graph from separate k-mers (especially not deduplicated) will take orders of magnitude longer. The contigs serve as an equivalent non-redundant representation of the k-mer sets and, thus, result in the same graph annotation. **Thus, in practice,** for large inputs, it is recommended to construct individual (canonical) de Bruijn graphs from all read sets, called sample graphs, and transform them to contigs. These contig sets are then used instead of the original read sets to construct and annotate the joint (primary) graph. .. _annotate_from_KMC: Annotate from KMC counters """""""""""""""""""""""""" This might depend on the particular graph representation used to store the joint graph. However, with the ``succinct`` graph representation, it is never efficient to annotate a graph directly from KMC counters because in their format k-mers are not ordered, which leads to many random k-mer lookups in the BOSS table. There is, however, an extra pre-processing step, which makes this task efficiently solvable. First, one can assemble a graph from the KMC counter (see :ref:`construct_from_KMC`) and extract contigs from it (see :ref:`to_sequences` or :ref:`graph_cleaning`). Next, annotate the graph using these assembled contigs as a normal FASTA file instead of the original KMC counter. .. tip:: Together, :ref:`construct_from_KMC` and :ref:`annotate_from_KMC` provide an efficient algorithm for constructing an annotated graph from many input files with sequences (e.g., indexing SRA experiments). Namely, this algorithm includes the following steps. #. Deduplicate all k-mers in all files with KMC (construct KMC counters, possibly with filtering by abundance) #. :ref:`Construct sample graphs from all KMC counters ` #. :ref:`Extract contigs ` from each sample graph (possibly *primary* contigs, can also be combined with :ref:`graph cleaning `) #. Build a large joint graph from all extracted contigs (possibly, :ref:`in the primary mode `) #. Annotate this joint graph using the same files with contigs Annotate using custom labels ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To add a custom annotation label for all k-mers from an input file, add ``--anno-label `` when annotating the graph. .. _indexing counts: Index k-mer counts ^^^^^^^^^^^^^^^^^^^^^ MetaGraph supports indexing k-mer counts (k-mer abundances), e.g., to represent gene expression in RNA-seq data. The counts can supplement graphs in any representation. To construct a MetaGraph index with k-mer counts (Counting de Bruijn graph), construct a de Bruijn graph as usual (see :ref:`construct graph`) and then add ``--count-kmers`` to the annotation command, e.g.:: metagraph annotate -v -i graph.dbg --anno-filename --count-kmers -p 4 \ -o annotation transcripts_1000.fa Along with the normal (binary) graph annotation ``annotation.column.annodbg``, this command will also create an array of corresponding k-mer counts ``annotation.column.annodbg.counts``. These counts represent how many times each k-mer indexed in ``graph.dbg`` occurs in the input file ``transcripts_1000.fa``. .. note:: By default, each count is stored in an 8-bit integer, and all counts greater than 255 are clipped. This value, however, can be changed with the flag ``--count-width``, to represent counts greater than 255 or, the other way around, clip all large counts when only lowly abundant k-mers are of interest. All other flags (e.g., ``--separately`` and ``--disk-swap``) described above are also supported similarly as for binary annotations. The histogram for indexed k-mer counts can be viewed with:: metagraph stats -a annotation.column.annodbg --print-counts-hist It is also possible to compute and print quantiles of all indexed counts. For instance, type the following command to compute the *minimum* non-zero count, as well as the *median* and the *maximum* count:: metagraph stats -a annotation.column.annodbg --count-quantiles '0 0.5 1' .. _annotate_with_precounting: Annotate with pre-counting """""""""""""""""""""""""" Similarly to the case of simple binary annotations considered above, it is recommended to pre-count k-mers for each annotation label (typically, sequencing sample) before annotating it. This technique consists in first constructing a *weighted* de Bruijn graph (see :ref:`construct_weighted_graph` and note that it can be constructed from raw input sequences as well as from pre-computed KMC counters) and transforming it to contigs with counts associated with their constituting k-mers:: metagraph transform -v -p 4 --to-fasta -o contigs sample_graph.dbg Then, the contigs written to ``contigs.fasta.gz`` and the counts associated with their k-mers written to ``contigs.kmer_counts.gz`` can be used when constructing a count-aware graph annotation:: metagraph annotate -v -i graph.dbg --anno-filename --count-kmers -p 4 \ -o annotation contigs.fasta.gz .. warning:: If something went wrong and no counts could be read from file ``contigs.kmer_counts.gz``, a warning message ``[warning] No k-mer counts found ...`` will be printed and ``metagraph annotate`` will proceed with the assumption that the count of each k-mer is equal to 1. Query k-mer counts """""""""""""""""" For querying k-mer counts (abundances), for example, to see how highly a gene is expressed in the indexed RNA-seq samples, the annotation should be transformed to a representation that combines both the binary annotation matrix (from ``*.column.annodbg``) and the count values (from ``*.column.annodbg.counts``). For more details, see section :ref:`transform_count_annotations`. Once the annotation is transformed, k-mer abundances can be queried with:: metagraph query --query-mode counts ... Note that if flag ``--query-mode counts`` is not passed, the index will be queried in the default k-mer presence/absence regime. .. _indexing coordinates: Index k-mer coordinates ^^^^^^^^^^^^^^^^^^^^^^^^^^ Besides indexing k-mer counts, MetaGraph supports indexing k-mer coordinates, that is, their positions in the source input. These may represent positions in a genome, positions of a k-mer in a raw SRA experiment (say, each read has 70 k-mers in it; then the second k-mer of the third read has coordinate 211). Depending on the target application and the final goal, it is both possible to consider each sequence of the input as a separate label and index the coordinates of its k-mers separately or, for the other extreme, put everything into a single label and use the annotated coordinates of the k-mers to find the borders of each indexed sequence in post-processing query results. In all cases, it is possible to reconstruct the original input from indexes of this kind, which makes this indexing method fully lossless (see more details in paper ``_). .. TODO: mention trace-consistent alignment To construct a MetaGraph index with k-mer coordinates (represented as a Counting de Bruijn graph), construct a de Bruijn graph as usual (see :ref:`construct graph`) and then add ``--coordinates`` to the annotation command, e.g.:: metagraph annotate -v -i graph.dbg --anno-filename --coordinates -p 4 \ -o annotation transcripts_1000.fa Along with the normal (binary) graph annotation ``annotation.column.annodbg``, this command will create an array of corresponding k-mer coordinates ``annotation.column.annodbg.coords``. Annotation in the ``--anno-header`` mode is also supported. In that case, a new annotation label will be created for every sequence in the input, and the first coordinate for every starting k-mer will be re-set to 0 for every sequence. Note, however, that this assumes that all sequence headers in the FASTA file are unique and do not repeat. If this condition is not met, an error will be returned. All other flags (e.g., ``--separately`` and ``--disk-swap``) described above are also supported similarly as for binary annotations. Query k-mer coordinates """"""""""""""""""""""" Once a coordinate-aware annotation is constructed, it can be transformed into a more memory-efficient representation supporting querying (see :ref:`transform_coord_annotations` below) and then queried with:: metagraph query --query-mode coords ... As the coordinate-aware annotations also contain the information about k-mer abundance, they can be queried to retrieve k-mer counts (simply pass ``--query-mode counts`` instead of ``--query-mode coords``). Note that if neither ``--query-mode coords`` nor ``--query-mode counts`` is passed, the index will be queried in the default k-mer presence/absence regime. .. _transform annotation: Transform annotation ^^^^^^^^^^^^^^^^^^^^ To enhance the query performance and reduce the memory footprint, annotations can be converted to other representations. There are several different annotation representations available in MetaGraph (see the possible values for flag ``--anno-type`` in ``metagraph transform_anno``). For instance, ``Rainbowfish`` can be used to achieve a very fast query speed, but it can be applied only to relatively small problem instances (about 100 GB) because of the limited compression performance and the complexity of the construction algorithm. In contrast, ``RowDiff`` typically achieves the best compression while still providing a good query performance, and thus, it is recommended for very large problem instances. Finally, ``RowDiff`` provides a good trade-off between the query speed and compression performance. Convert annotation to Rainbowfish """"""""""""""""""""""""""""""""" The conversion to Rainbowfish consists of two steps. 1. First, convert the column-compressed annotation to the row-major representation:: find . -name "*.column.annodbg" | metagraph transform_anno -v \ --anno-type row \ -o annotation ... 2. Then, transform the row-major annotation to the compressed Rainbowfish representation:: metagraph transform_anno -v --anno-type rbfish \ -o annotation \ annotation.row.annodbg .. _to_multi_brwt: Convert annotation to Multi-BRWT """""""""""""""""""""""""""""""" The conversion to ``Multi-BRWT`` can be done either * with a single command, e.g.:: find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \ --greedy -o anno * or with pre-computing a column clustering with:: find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \ --linkage --greedy -o linkage.txt and next converting the annotation to Multi-BRWT according to the pre-computed clustering (linkage):: find . -name "*.column.annodbg" | metagraph transform_anno -v -p 18 --anno-type brwt \ --linkage-file linkage.txt -o anno .. note:: If the clustering is too slow, it probably means it uses too many subsampled rows. In this case, consider changing the value passed with flag ``--subsample ``. The 1M rows subsampled by default are usually enough even for very large annotations. Increasing this value usually does not lead to any significantly better compression. Finally, the internal structure of the BRWT tree can be relaxed (which is always recommended to do) to increase the arity of its internal nodes and enhance the compression:: metagraph relax_brwt -v -p 18 -o anno_relaxed anno.brwt.annodbg .. note:: By default, ``metagraph transform_anno --anno-type brwt`` uses disk swap for temporary files created for each annotation column (label), which might be inappropriate when the number of columns is too large (around a million or more). In such cases, pass an additional flag ``--disk-swap ""`` to compute everything in-memory without creating temp files. .. _to_row_diff_brwt: Convert annotation to RowDiff """"""""""""""""""""""""""""""""""""""""" The conversion to ``RowDiff`` is done in two steps. 1. Transform annotation columns ``*.column.annodbg`` to ``row_diff`` in three stages:: find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \ --anno-type row_diff --row-diff-stage 0 \ -i graph.dbg --mem-cap-gb 300 find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \ --anno-type row_diff --row-diff-stage 1 \ -i graph.dbg --mem-cap-gb 300 find . -name "*.column.annodbg" | metagraph transform_anno -v -p 36 \ --anno-type row_diff --row-diff-stage 2 \ -i graph.dbg --mem-cap-gb 300 Note that this requires to pass the graph ``graph.dbg`` as well in order to derive the topology for the diff-transform. 2. Transform the diff-transformed columns ``*.row_diff.annodbg`` to ``Multi-BRWT``:: find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \ -i graph.dbg \ --anno-type row_diff_brwt \ --greedy ... metagraph relax_brwt -v -p 18 \ --relax-arity 32 \ -o annotation_relaxed \ annotation.row_diff_brwt.annodbg Also see the above paragraph :ref:`to_multi_brwt` for other options. .. _to_row_diff_sparse: Convert annotation to RowDiff """"""""""""""""""""""""""""""""""""""""" The conversion to ``RowDiff`` is similar to :ref:`to_row_diff_brwt`. The first step is the same. In the second step, the diff-transformed columns ``*.row_diff.annodbg`` are converted to ``RowSparse``:: find . -name "*.row_diff.annodbg" | metagraph transform_anno -v -p 18 \ -i graph.dbg \ --anno-type row_diff_sparse .. _transform_count_annotations: Convert count-aware annotations """"""""""""""""""""""""""""""" Converting a graph annotation supplemented with k-mer counts (``*.column.annodbg`` + ``*.column.annodbg.counts``) to Int-Multi-BRWT (``int_brwt``) is done exactly the same way as converting a binary annotation to Multi-BRWT (see :ref:`to_multi_brwt`), with simply replacing ``--anno-type brwt`` with ``--anno-type int_brwt``:: metagraph transform_anno --anno-type int_brwt --greedy ... For converting to RowDiff (``row_diff_int_brwt``), perform the same steps as when :ref:`converting to RowDiff\ ` with the following exceptions. First, an additional flag ``--count-kmers`` has to be passed on step 1 (the row-diff transform). Second, on step 2, delta-transformed columns have file extension ``.column.annodbg`` and not ``.row_diff.annodbg``. These columns should be passed to the ``metagraph transform_anno --anno-type row_diff_int_brwt`` command. The corresponding transformed delta counts will be loaded automatically. For further examples on real data, see ``_. .. _transform_coord_annotations: Convert coordinate-aware annotations """""""""""""""""""""""""""""""""""" Conversion to ``column_coord`` is straightforward. Conversion to ``brwt_coord`` is analogous to ``brwt`` and ``int_brwt``. Conversion to ``row_diff_brwt_coord`` is analogous to ``row_diff_brwt`` and ``row_diff_int_brwt``, where an additional flag ``--coordinates`` has to be passed. Additionally, one can convert the delta-transformed columns with coordinates (after step ``--anno-type row_diff --coordinates``) directly to the ColumnCompressed format (``row_diff_coord``), equivalent to ``row_diff_brwt_coord`` with the arity set to infinity, that is, all leaves (original labels) directly connected to the root of the BRWT tree. Query index ----------- Using Command Line Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To query a MetaGraph index (graph + annotation) using the command line interface (CLI), run ``metagraph query``, e.g.:: metagraph query -i graph.dbg \ -a annotation.column.annodbg \ --discovery-fraction 0.1 \ transcripts_1000.fa For alignment, see ``metagraph align``. To load up a MetaGraph index in server mode for querying it with the Python API or via HTTP requests, run:: metagraph server_query -i graph.dbg \ -a annotation.column.annodbg \ --port \ --parallel Using Python API ^^^^^^^^^^^^^^^^ See :ref:`api` Column operations ----------------- MetaGraph supports operations aggregating multiple annotation columns to compute statistics for the k-mers and their counts, e.g.:: metagraph transform_anno --aggregate-columns -o out \ --min-count 2 --max-count 2 \ A.column.annodbg \ B.column.annodbg to construct a new annotation column ``out.column.annodbg`` which computes AND between columns A and B. Despite the simplistic appearance of this command, it can also compute many other complex operations. Below we provide a few common examples of such aggregating operations. Examples ^^^^^^^^ 1. Select all "unique" k-mers, that is, appearing only in a single annotation column among columns in annotation ``annotation.column.annodbg``:: metagraph transform_anno --aggregate-columns -o out \ --max-count 1 annotation.column.annodbg 2. Select all "common" k-mers, that is, appearing in at least 95% of annotation columns:: metagraph transform_anno --aggregate-columns -o out \ --min-fraction 0.95 annotation.column.annodbg .. note:: This command (``metagraph transform_anno --aggregate-columns ...``) only supports annotations in the ColumnCompressed format, that is, constructed by the ``metagraph annotate`` command. If the input columns have associated counts (e.g., constructed with ``metagraph annotate --count-kmers ...``), they can be loaded and used by the aggregator as well. Aggregating function ^^^^^^^^^^^^^^^^^^^^ In general, the following formula is used for aggregation: .. math:: \text{min-count} <= \sum_i 1\{\text{min-value} <= c_i <= \text{max-value}\} <= \text{max-count}, where :math:`c_i` is the count for the current k-mer (see :ref:`indexing counts`). If no counts are associated with the column, :math:`c_i = 1` for every set bit and :math:`0` otherwise. If this sum falls within specified :math:`\text{min-count}` and :math:`\text{max-count}`, the bit in the aggregated column for this k-mer is set to 1, and the value of the sum is written as the count associated with that bit. In other words, the output aggregated column is always supplemented with a count vector, which can be interpreted as normal k-mer counts. For instance, consider the following example. 1. Suppose, first we want to compute in how many columns each k-mer occurs:: metagraph transform_anno --aggregate-columns -o out annotation.column.annodbg 2. Then we can inspect the histogram of k-mer frequencies to, say, find an appropriate threshold for maximum frequency:: metagraph stats -a out.column.annodbg --print-counts-hist 3. Finally, after inspecting the histogram and selecting a reasonable threshold (suppose, we decided to filter out all k-mers that occur in more than 10 columns), we can apply it to the aggregated column as to a normal original column with counts:: metagraph transform_anno --aggregate-columns -o rare_kmers \ --max-value 10 out.column.annodbg which generates the final column ``rare_kmers.column.annodbg`` with the mask indicating all k-mers occurring in 10 or fewer input columns in the original file ``annotation.column.annodbg``.