.. _faq: ========================== Frequently Asked Questions ========================== General ======= What is MetaGraph? ------------------ MetaGraph is a framework for scalable construction and querying of very large annotated de Bruijn graphs. It uses succinct representations to index sequence collections efficiently, requiring only 2-4 bits per k-mer. **Key capabilities:** - Index trillions of k-mers with minimal memory - Query sequences with exact matching or alignment - Annotate k-mers with labels, counts, or coordinates - Support DNA, RNA, and protein sequences When should I use MetaGraph? ----------------------------- **Good for:** - Large-scale sequence search (millions of samples) - Metagenomic classification - Pangenome analysis - Multi-sample comparison **Not ideal for:** - Single small datasets (MetaGraph might be overkill -- use BLAST) - Full alignment profile or exact alignment (use dedicated aligners) Installation ============ Which installation method? -------------------------- 1. **Conda** (easiest): ``conda install -c bioconda metagraph`` 2. **Docker** (isolated): ``docker run ghcr.io/ratschlab/metagraph:master`` 3. **Source** (for custom alphabets or latest features) See :ref:`installation`. Which alphabet? --------------- - **DNA** (default): DNA/RNA sequences (A,C,G,T) - **DNA5**: With ambiguous bases (A,C,G,T,N) - **DNA_CASE_SENSITIVE**: When case matters (A,C,G,T,N,a,c,g,t) - **Protein**: Amino acid sequences (A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,Z,X) Graph Construction ================== What k-mer size? ---------------- **General rules:** - DNA/RNA: k=21-41 (default: 31) - Protein: k=7-15 (default: 10) - Longer k = more specific, less sensitive - Must be ≤ sequence/read length How many threads to use? ------------------------ Most of the routines in MetaGraph are well parallelized, so the more threads, the better. More than the number of cores is not recommended. What buffer size? ----------------- When constructing very large graphs, it is recommended to use disk swap to limit the RAM usage:: metagraph build --disk-swap --mem-cap-gb 10 - For small graphs: 1 GB is enough (Default) - For large graphs: the more, the better (50-80 GB is always sufficient, even for trillions of k-mers) - Larger than 80 GB is **not** recommended - ```` may be on a spinning disk (SSD might help but not necessary) Canonical vs Primary graph? ---------------------------- **Workflow:** 1. Build **canonical** from reads (contains k-mer + reverse complement) 2. Extract primary contigs: ``metagraph transform --to-fasta --primary-kmers`` 3. Build **primary** graph (half the size, same information) **Important:** Use primary graphs with RowDiff annotations for best compression. How to handle large datasets? ------------------------------ 1. **Use disk swap:** .. code-block:: bash metagraph build -k 31 --disk-swap /tmp --mem-cap-gb 50 input.fa 2. **Build sample graphs separately:** .. code-block:: bash # Per sample for sample in samples/*.fasta.gz; do metagraph build -k 31 --mode basic -o $sample $sample metagraph transform --to-fasta --primary-kmers -o ${sample}.contigs ${sample}.dbg done # Joint graph from all contigs ls samples/*.contigs.fasta.gz | metagraph build -k 31 --mode canonical -o joint # Extract primary contigs from the joint graph metagraph transform --to-fasta --primary-kmers -o joint_contigs_primary joint.dbg # Joint primary graph from all contigs metagraph build -k 31 --mode primary -o joint_primary joint_contigs_primary.fasta.gz 3. **Extremely large graphs:** Extremely large succinct graphs can be constructed by building their parts separately and writing them to disk on the fly with flag ``--inplace``. In such cases, don't forget to index suffix ranges afterwards with ``metagraph transform --index-ranges ...``. Annotation ========== Which annotation type? ---------------------- **By scale:** - In most scenarios, start with: ``column`` (default) - Fast and easy to construct: ``row_flat`` - Fast queries and small: ``row_diff_flat`` - Very large scale: ``row_diff_brwt`` (best compression) **By feature:** - K-mer counts: ``int_brwt`` or ``row_diff_int_brwt`` - Coordinates: ``brwt_coord`` or ``row_diff_brwt_coord`` How to annotate with counts? ----------------------------- **Recommended (for large data):** .. code-block:: bash # Build weighted sample graph metagraph build -k 31 --count-kmers -o sample.graph sample.fasta.gz # Extract contigs with counts metagraph transform --to-fasta -o contigs sample.graph.dbg # Creates: contigs.fasta.gz + contigs.kmer_counts.gz # Annotate joint graph metagraph annotate -i joint.dbg --count-kmers \ --anno-filename -o annotation contigs.fasta.gz **Simple (not recommended for large data):** .. code-block:: bash metagraph annotate -i graph.dbg --count-kmers \ --anno-filename -o annotation input.fa **What if my k-mer counts are very large?** Pass the ``--count-width`` flag to specify the number of bits used to represent the counts. Default is 8 bits (max 255). E.g., with 12 bits, the max count is 4095: .. code-block:: bash metagraph build -k 31 --count-kmers --count-width 12 -o sample.graph sample.fasta.gz metagraph transform --to-fasta -o contigs sample.graph.dbg metagraph annotate -i joint.dbg --count-kmers --count-width 12 \ --anno-filename -o annotation contigs.fasta.gz Querying ======== How to query? ------------- **Command line:** .. code-block:: bash # Presence/absence metagraph query -i graph.dbg -a annotation.annodbg query.fa # With threshold (80% k-mers must match) metagraph query --discovery-fraction 0.8 ... # K-mer counts metagraph query --query-mode counts -a annotation.int_brwt.annodbg ... # Coordinates metagraph query --query-mode coords -a annotation.brwt_coord.annodbg ... **Python API:** 1. Start metagraph in the server mode: .. code-block:: bash metagraph server_query -i graph.dbg -a annotation.annodbg --port 5555 -p 10 --mmap 2. Query the index: .. code-block:: python from metagraph.client import GraphClient client = GraphClient('localhost', 5555, api_path='') results = client.search('ACGTACGT', discovery_fraction=0.8) What is discovery_fraction? ---------------------------- Minimum fraction of k-mers that must match: - **0.0**: Any match (at last 1 k-mer must match) - **0.8**: ≥80% k-mers match (recommended for classification) - **1.0**: All k-mers match (strict) How to align? ------------- **Exact k-mer matching:** .. code-block:: bash metagraph align --map -i graph.dbg reads.fa **Sequence-to-graph alignment:** .. code-block:: bash metagraph align -i graph.dbg reads.fa # With annotation (label-consistent paths) metagraph align -i graph.dbg -a annotation.annodbg reads.fa # Adjust sensitivity metagraph align --align-min-seed-length 15 --min-exact-match 0.7 ... Performance =========== Memory requirements? -------------------- **Graph:** - Default construction: for k=31, at least 16 bytes per k-mer plus overhead (1B k-mers ≈ 18 GB) - Construction with disk swap: buffer size plus output graph size - Succinct (stat): ~4 bits/k-mer (10B k-mers ≈ 5 GB) - Succinct (small): ~2 bits/k-mer (10B k-mers ≈ 2.5 GB) **Large-scale indexing:** When indexing large-scale datasets, everything depends on the chosen buffer sizes. With carefully selected parameters, one can build and annotate graphs with trillion of k-mers. For real examples, see https://github.com/ratschlab/metagraph/blob/master/metagraph/experiments/large_index_scripts.md. How to speed up construction? ------------------------------ 1. **Use more threads:** ``-p 64`` 2. **Pre-process the input sequences with KMC:** ``kmc -k31 -m40 -sm input.fasta.gz output /tmp/`` 3. **Parallelize samples:** Build sample graphs in parallel, with 4-8 threads (``-p ...``) on each. How to reduce size? ------------------- 1. **Use primary graph:** 50% smaller than canonical (only applies to indexing reads) 2. **Use RowDiff** for annotation (10-20% of **Column**) 3. **Relax BRWT:** ``metagraph relax_brwt --relax-arity 32 ...`` (5-20% smaller) 4. **Transform graph:** ``metagraph transform --state small ...`` 5. **Filter out low-abundance k-mers:** ``kmc -ci2 ...`` before building to remove singleton k-mers 6. **Use other graph cleaning techniques:** ``metagraph clean ...`` to remove sequencing errors How to reduce RAM usage while querying? -------------------------------- - **Use memory mapping (--mmap):** see :ref:`memory_mapping` - **Reduce the batch size:** see ``./metagraph query --batch-size ...`` - **Alternative: Use disk-backed formats:** ``row_disk`` or ``row_diff_disk`` When to use memory mapping? ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - When the graphs are extremely large - When the available RAM is limited When not to use memory mapping? ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Never use it with slow (e.g., spinning) disks, unless it's for a single stats check Troubleshooting =============== Top issues and solutions: Installation fails ------------------ **Conda:** Try installing in a fresh environment **Docker:** Pull latest: ``docker pull ghcr.io/ratschlab/metagraph:master`` **Source:** Check compiler version and the dependencies See :ref:`installation` and :ref:`troubleshooting`. Out of memory ------------- **Solution:** Use disk swap .. code-block:: bash metagraph build --disk-swap /tmp --mem-cap-gb 50 ... metagraph annotate --disk-swap /tmp --mem-cap-gb 10 ... RowDiff transform generates no output ------------------------------------- **Solution:** Remember that for annotations with coordinates and counts, the output files have the same name ``*.column.annodbg`` as the input files, hence, should be written to a new directory. (See :ref:`transform_count_annotations`.)