MetaGraph
Ultra Scalable Framework for DNA Search, Alignment, Assembly
About
The MetaGraph framework allows for indexing and analysis of very large biological sequence collections, producing compressed indexes that can represent several petabases of input data. The indexes can be efficiently queried with any query sequence of interest. Read more in the paper preprint.
Sourcing on raw sequencing data available in public archives such as SRA or ENA, MetaGraph makes this treasure trove of information directly accessible for full text search, helping to discover whether any given sequence has ever been observed before and, if yes, in which context.
The featureful API enables both exact k-mer matching as well as inexact search (alignment). The search results are associated with the annotations available for the matches in the index, providing information on, e.g., the sample source or other associated metadata.
531,736
Plant SRA experiments (1.1 Pbp)
923 billion k-mers
121,900
Fungi SRA experiments (160 Tbp)
130 billion k-mers
446,506
Bacterial SRA experiments (221 Tbp)
39.5 billion k-mers
242,619
Human gut metagenome samples (725 Tbp)
297 billion k-mers
4,220
Microbial MetaSUB samples (7.2 Tbp)
35.2 billion k-mers
Indexing workflow
The MetaGraph framework is designed to work with a wide range of input data sets, indexing from a few samples up to the contents of entire archives with hundreds of thousands of records. The indexing workflow always follows the same principle, transforming single input samples into error-removed, refined sample graphs, which are then merged into a joint metagraph index. Each input sample is annotated in the joint index as a subgraph. This graph index enriched with metadata can then be used for downstream applications such as sequence search or differential assembly.
Sequence query
Once constructed, a MetaGraph index provides a powerful resource for data analysis. All information contained in the index can be efficiently retrieved via sequence search. The current framework offers both high throughput exact matching of k-mers, intersecting the k-mers present in the query with the k-mers in the index and returning the annotation labels of the intersection, as well as sequence-to-graph alignment, which returns all paths in the graph within a certain edit-distance and their associated annotation labels.
Differential assembly
Differential assembly is another concept proposed in MetaGraph, which aims at finding sequences specific to certain groups of interest. For instance, these can be tissue-specific splice variants in RNA. As a special case, this can also be used for a simple large-scale de novo genome assembly.