mavis/bam/read

class SamRead

inherits pysam.AlignedSegment

Subclass to extend the pysam.AlignedSegment class adding some utility methods and convenient representations

Allows next_reference_name and reference_name to be set directly so that is does not depend on a bam header

SamRead.init()

def __init__(
    self, reference_name=None, next_reference_name=None, alignment_score=None, **kwargs
):

Args

reference_name
next_reference_name
alignment_score

SamRead.key()

uses a stored _key attribute, if available. This is to avoid the hash changing if the reference start (for example) is changed but also allow this attribute to be used and calculated for non SamRead objects

This way to change the hash behaviour the user must be explicit and use the set_key method

def key(self):

SamRead.deletion_sequences()

returns the reference sequences for all deletions

def deletion_sequences(self, reference_genome):

Args

reference_genome

SamRead.insertion_sequences()

returns the inserted sequence for all insertions

def insertion_sequences(self):

pileup()

For a given set of reads generate a pileup of all reads (excluding those for which the filter_func returns True)

def pileup(
    reads: Iterable[pysam.AlignedSegment], filter_func: Optional[Callable] = None
) -> List[Tuple[int, int]]:

Args

reads (Iterable[pysam.AlignedSegment]): reads to pileup
filter_func (Optional[Callable]): function which takes in a read and returns True if it should be ignored and False otherwise

Returns

List[Tuple[int, int]]: tuples of genomic position and read count at that position

Note

returns positions using 1-based indexing

breakpoint_pos()

assumes the breakpoint is the position following softclipping on the side with more softclipping (unless and orientation has been specified)

def breakpoint_pos(read: pysam.AlignedSegment, orient: str = ORIENT.NS) -> int:

Args

read (pysam.AlignedSegment): the read object
orient (str): the orientation

Returns

int: the position of the breakpoint in the input read

calculate_alignment_score()

calculates a score for comparing alignments

def calculate_alignment_score(read: pysam.AlignedSegment, consec_bonus=1) -> float:

Args

read (pysam.AlignedSegment): the input read
consec_bonus

Returns

float: the score

nsb_align()

given some reference string and a smaller sequence string computes the best non-space-breaking alignment i.e. an alignment that does not allow for indels (straight-match). Positions in the aligned segments are given relative to the length of the reference sequence (1-based)

def nsb_align(
    ref: str,
    seq: str,
    min_overlap_percent: float = 1,
    min_match: float = 0,
    min_consecutive_match=1,
    scoring_function: Callable = calculate_alignment_score,
) -> List[SamRead]:

Args

ref (str): the reference sequence
seq (str): the sequence being aligned
min_overlap_percent (float): the minimum amount of overlap of the input sequence to the reference should be a number between 0 and 1
min_match (float): the minimum number of matches compared to total
min_consecutive_match
scoring_function (Callable): any function that will take a read as input and return a float used in comparing alignments to choose the best alignment

Returns

List[SamRead]: list of aligned segments

Note

using a higher min_match may improve performance as low quality alignments are rejected more quickly. However this may also result in no match being returned when there is no high quality match to be found.

sequenced_strand()

determines the strand that was sequenced

def sequenced_strand(read: pysam.AlignedSegment, strand_determining_read: int = 2) -> str:

Args

read (pysam.AlignedSegment): the read being used to determine the strand
strand_determining_read (int): which read in the read pair is the same as the sequenced strand

Returns

str: the strand that was sequenced

Raises

ValueError: if strand_determining_read is not 1 or 2

Warning

if the input pair is unstranded the information will not be representative of the strand sequenced since the assumed convention is not followed

read_pair_type()

assumptions based on illumina pairs: only 4 possible combinations

def read_pair_type(read: pysam.AlignedSegment) -> str:

Args

read (pysam.AlignedSegment): the input read

Returns

str: the type of input read pair

Raises

NotImplementedError: for any read that does not fall into the four expected configurations (see below)

Note

++++> <---- is LR same-strand ++++> ++++> is LL opposite <---- <---- is RR opposite <---- ++++> is RL same-strand

orientation_supports_type()

checks if the orientation is compatible with the type of event

def orientation_supports_type(read: pysam.AlignedSegment, event_type: str) -> bool:

Args

read (pysam.AlignedSegment): a read from the pair
event_type (str): the type of event to check

Returns

bool: - True - the read pair is in the correct orientation for this event type - False - the read is not in the correct orientation

convert_events_to_softclipping()

given an alignment, simplifies the alignment by grouping everything past the first anchor and including the first event considered too large and unaligning them turning them into softclipping

def convert_events_to_softclipping(
    read: pysam.AlignedSegment,
    orientation: str,
    max_event_size: int,
    min_anchor_size: Optional[int] = None,
) -> pysam.AlignedSegment:

Args

read (pysam.AlignedSegment)
orientation (str)
max_event_size (int)
min_anchor_size (Optional[int])

Returns

pysam.AlignedSegment

sequence_complexity()

basic measure of sequence complexity

def sequence_complexity(seq: str) -> float:

Args

seq (str)

Returns

float

mavis/bam/read

class SamRead

SamRead.__init__()

SamRead.key()

SamRead.deletion_sequences()

SamRead.insertion_sequences()

pileup()

breakpoint_pos()

calculate_alignment_score()

nsb_align()

sequenced_strand()

read_pair_type()

orientation_supports_type()

convert_events_to_softclipping()

sequence_complexity()

SamRead.init()