Snakemake workflows¶
Since each indexing workflow in MetaGraph comprises several steps, we provide automated pipelines to make the process easier and more straightforward for the most common scenarios.
Installation¶
The metagraph-workflows CLI ships as a separate Python package. The
metagraph conda recipe only installs the C++ binary, so the workflow
wrapper needs an extra pip install step alongside it:
conda create -n metagraph-workflows python=3.8
conda activate metagraph-workflows
conda install -c bioconda -c conda-forge metagraph # the metagraph binary
pip install -U "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows"
After this, the metagraph binary and the metagraph-workflows
command are both available on PATH.
Creating graphs and annotations¶
A single command runs the full pipeline — graph construction, annotation, and all row-diff / BRWT transforms — for a set of samples:
metagraph-workflows build samples.txt -o /tmp/mygraph --primary
samples.txt is a text file listing input paths (one per line); a
directory of sample files also works. Process substitution is supported
too, so you can pipe a glob inline:
metagraph-workflows build <(ls /data/samples/*.fa) -o /tmp/mygraph --primary
The same pipeline can be invoked from a Python script:
from metagraph_workflows import cli
cli.run_workflow('/tmp/mygraph', samples='samples.txt', k=31, build_primary_graph=True)
The pipelines are written in the Snakemake workflow management system and can also be directly invoked using the snakemake command line tool (see below).
Usage¶
Typically, the following steps would be performed:
Prepare a list of input files (or a directory).
Construct a MetaGraph index: invoke
metagraph-workflows build. Tell the workflow how much hardware is available and what kind of index you want; the per-stage memory caps, thread packing, and BRWT clustering parameters are derived automatically.Important parameters you may want to set:
-p Nand--mem-gb GBfor the hardware budget-kfor k-mer length (default 31)--primaryfor primary graph mode (recommended for most workloads)--disk-swap-dir DIRto enable on-disk spill buffers--anno-source(headerorfilename; defaultfilename)--anno-type FMTto choose / add output annotation formats--with-countsor--with-coordsfor count- / coordinate-aware annotation (mutually exclusive)--graph EXISTING.dbgto reuse an already-built graph and run only the annotation + transform stages
An example invocation:
metagraph-workflows build samples.txt -o /tmp/mygraph \ -k 31 --primary \ -p 34 --mem-gb 70 --disk-swap-dir /scratch/swap
Count-aware annotations¶
The workflow supports these count-aware annotation formats:
int_brwtrow_diff_int_brwtrow_diff_int_disk
To enable counts explicitly, pass --with-counts. If no annotation format is specified,
the default switches from relax.row_diff_brwt to row_diff_int_brwt:
metagraph-workflows build transcript_paths.txt -o [OUTPUT_DIR] \
-k 31 --primary --with-counts --count-width 12
You can also select a count-aware format directly via --anno-type; this
automatically enables count-aware mode:
metagraph-workflows build transcript_paths.txt -o [OUTPUT_DIR] \
-k 31 --primary --anno-type row_diff_int_brwt --count-width 12
Use --count-width to control the stored numeric range for counts
(valid range: 2..32, default: 8).
When reusing an output directory, the workflow keeps count and non-count intermediates in separate mode-specific directories to avoid stale artifact reuse.
Coordinate-aware annotations¶
The workflow supports these coordinate-aware annotation formats:
brwt_coordrow_diff_coordrow_diff_brwt_coordrow_diff_disk_coord
To enable coordinates explicitly, pass --with-coords. If no annotation format is specified,
the default switches from relax.row_diff_brwt to row_diff_brwt_coord:
metagraph-workflows build transcript_paths.txt -o [OUTPUT_DIR] \
-k 31 --with-coords
You can also select a coordinate-aware format directly via --anno-type; this
automatically enables coordinate-aware mode:
metagraph-workflows build transcript_paths.txt -o [OUTPUT_DIR] \
-k 31 --anno-type row_diff_brwt_coord
Coordinates are typically indexed for reference sequences, where preserving the original sequence context is important. For this use case, primary graph mode is usually not recommended.
When coordinate-aware annotation is built with --anno-source filename
(the default; one column per input file), the workflow additionally builds
a CoordToHeader sidecar at <graph>.seqs. This is a second
metagraph annotate --index-header-coords pass over the input fastas;
the pass uses the column order recovered from the final annotation via
metagraph stats --print-col-names so that file labels and coord
offsets stay aligned even after BRWT clustering reorders columns. The
sidecar lets metagraph query --query-mode coords and metagraph
align report hits as <header>/<N>:<positions> instead of file-based
coordinates. In --anno-source header mode each sequence already has
its own column, so no sidecar is needed and the rule is skipped.
The loader always looks at <graph>.seqs (it strips the full
annotation extension and appends .seqs), so a single sidecar is
emitted regardless of how many coord formats are requested. When
multiple are requested, the sidecar is derived from the first one in
the list – pick that format when querying with coords.
Count-aware and coordinate-aware modes are mutually exclusive in this workflow.
Row-diff transform outputs¶
The annotate step writes columns.<mode>/<basename>.column.annodbg for each input sequence
file. Row-diff stages 0–2 then write under rd_cols.<mode>/ (for example rd_cols.binary/,
rd_cols.coords/, or rd_cols.counts.w8/ depending on configuration):
Binary annotation mode (no
--with-counts/--with-coords): stage 2 emits<basename>.row_diff.annodbgper column—theRowDiffColumnAnnotatorformat.Count or coordinate mode: stage 2 emits
<basename>.column.annodbgand, when applicable,<basename>.column.annodbg.countsand/or<basename>.column.annodbg.coords.
See metagraph-workflows build -h for more details.
Once a MetaGraph index has been created, it can be queried either by using the command line
metagraphtool or by starting the MetaGraph server directly on a laptop or on another suitable machine and querying it using the python Python API client.
There is also a jupyter notebook showing the whole process: from indexing to api querying on a simple example.
Workflow management¶
The following snakemake options are exposed in the build subcommand
--dryrun: see what workflow steps would be done--force(corresponds to--forceallin snakemake): force run all steps
Directly invoking Snakemake workflow¶
The metagraph-workflows command is only a wrapper around a snakemake workflow. You can also
directly invoke the snakemake workflow (assuming you checked out the metagraph git repository):
cd metagraph/workflows
snakemake --forceall --configfile default.yml \
--config k=5 seqs_file_list_path='transcript_paths.txt' output_directory=/tmp/mygraph \
annotation_labels_source=header --cores 2