assemble module

class mavis.assemble.Contig(sequence, score)[source]

Bases: object

add_mapped_sequence(read, multimap=1)[source]
complexity()[source]
remap_coverage()[source]
remap_depth(query_range=None)[source]

the average depth of remapped reads over a give range of the contig sequence

Parameters:query_range (Interval) – 1-based inclusive range
remap_score()[source]
class mavis.assemble.DeBruijnGraph(data=None, **attr)[source]

Bases: networkx.classes.digraph.DiGraph

wrapper for a basic digraph enforces edge weights

Initialize a graph with edges, name, graph attributes.

Parameters:
  • data (input graph) – Data to initialize graph. If data=None (default) an empty graph is created. The data can be an edge list, or any NetworkX graph object. If the corresponding optional Python packages are installed the data can also be a NumPy matrix or 2d ndarray, a SciPy sparse matrix, or a PyGraphviz graph.
  • name (string, optional (default='')) – An optional name for the graph.
  • attr (keyword arguments, optional (default= no attributes)) – Attributes to add to graph as key=value pairs.

See also

convert

Examples

>>> G = nx.Graph()   # or DiGraph, MultiGraph, MultiDiGraph, etc
>>> G = nx.Graph(name='my graph')
>>> e = [(1,2),(2,3),(3,4)] # list of edges
>>> G = nx.Graph(e)

Arbitrary graph attribute pairs (key=value) may be assigned

>>> G=nx.Graph(e, day="Friday")
>>> G.graph
{'day': 'Friday'}
add_edge(n1, n2, freq=1)[source]

add a given edge to the graph, if it exists add the frequency to the existing frequency count

all_edges(*nodes, data=False)[source]
get_edge_freq(n1, n2)[source]

returns the freq from the data attribute for a specified edge

get_sinks(subgraph=None)[source]

returns all nodes with an outgoing degree of zero

get_sources(subgraph=None)[source]

returns all nodes with an incoming degree of zero

trim_forks_by_freq(min_weight)[source]

for all nodes in the graph, if the node has an out-degree > 1 and one of the outgoing edges has freq < min_weight. then that outgoing edge is deleted

trim_noncutting_paths_by_freq(min_weight)[source]

trim any low weight edges where another path exists between the source and target of higher weight

trim_tails_by_freq(min_weight)[source]

for any paths where all edges are lower than the minimum weight trim

Parameters:min_weight (int) – the minimum weight for an edge to be retained
mavis.assemble.assemble(sequences, kmer_size, min_edge_trim_weight=3, assembly_max_paths=20, assembly_min_uniq=0.01, min_complexity=0, log=<function <lambda>>, **kwargs)[source]

for a set of sequences creates a DeBruijnGraph simplifies trailing and leading paths where edges fall below a weight threshold and the return all possible unitigs/contigs

drops any sequences too small to fit the kmer size

Parameters:
  • sequences (list of str) – a list of strings/sequences to assemble
  • kmer_size – see assembly_kmer_size the size of the kmer to use
  • min_edge_trim_weight – see assembly_min_edge_trim_weight
  • remap_min_match – Minimum match percentage of the remapped read (based on the exact matches in the cigar)
  • remap_min_overlap – defaults to the kmer size. Minimum amount of overlap between the contig and the remapped read
  • min_contig_length – Minimum length of contigs assemble to attempt remapping reads to. Shorter contigs will be ignored
  • remap_min_exact_match – see assembly_min_exact_match_to_remap
  • assembly_max_paths – see assembly_max_paths
  • log (function) – the log function
Returns:

a list of putative contigs

Return type:

list of Contig

mavis.assemble.digraph_connected_components(graph, subgraph=None)[source]

the networkx module does not support deriving connected components from digraphs (only simple graphs) this function assumes that connection != reachable this means there is no difference between connected components in a simple graph and a digraph

Parameters:graph (networkx.DiGraph) – the input graph to gather components from
Returns:returns a list of compnents which are lists of node names
Return type:list of list
mavis.assemble.filter_contigs(contigs, assembly_min_uniq=0.01)[source]

given a list of contigs, removes similar contigs to leave the highest (of the similar) scoring contig only

mavis.assemble.kmers(s, size)[source]

for a sequence, compute and return a list of all kmers of a specified size

Parameters:
  • s (str) – the input sequence
  • size (int) – the size of the kmers
Returns:

the list of kmers

Return type:

list of str

Example

>>> kmers('abcdef', 2)
['ab', 'bc', 'cd', 'de', 'ef']
mavis.assemble.pull_contigs_from_component(assembly, component, min_edge_trim_weight, assembly_max_paths, log=<mavis.util.Log object>)[source]

builds contigs from the a connected component of the assembly DeBruijn graph

Parameters:
  • assembly (DeBruijnGraph) – the assembly graph
  • component (list) – list of nodes which make up the connected component
  • min_edge_trim_weight (int) – the minimum weight to not remove a non cutting edge/path
  • assembly_max_paths (int) – the maximum number of paths allowed before the graph is further simplified
  • log (function) – the log function
Returns:

the paths/contigs and their scores

Return type:

Dict of int by str