mavis/bam/cigar
holds methods related to processing cigar tuples. Cigar tuples are generally an iterable list of tuples where the first element in each tuple is the CIGAR value (i.e. 1 for an insertion), and the second value is the frequency
EVENT_STATES
EVENT_STATES = {CIGAR.D, CIGAR.I, CIGAR.X}
ALIGNED_STATES
ALIGNED_STATES = {CIGAR.M, CIGAR.X, CIGAR.EQ}
REFERENCE_ALIGNED_STATES
REFERENCE_ALIGNED_STATES = ALIGNED_STATES | {CIGAR.D, CIGAR.N}
QUERY_ALIGNED_STATES
QUERY_ALIGNED_STATES = ALIGNED_STATES | {CIGAR.I, CIGAR.S}
CLIPPING_STATE
CLIPPING_STATE = {CIGAR.S, CIGAR.H}
recompute_cigar_mismatch()
for cigar tuples where M is used, recompute to replace with X/= for increased utility and specificity
def recompute_cigar_mismatch(read: pysam.AlignedSegment, ref: str) -> CigarTuples:
Args
- read (
pysam.AlignedSegment
): the input read - ref (
str
): the reference sequence
Returns
- CigarTuples: the cigar tuple
longest_fuzzy_match()
computes the longest sequence of exact matches allowing for 'x' event interrupts
def longest_fuzzy_match(cigar: CigarTuples, max_fuzzy_interupt: int = 1) -> int:
Args
- cigar (CigarTuples): cigar tuples
- max_fuzzy_interupt (
int
): number of mismatches allowed
Returns
int
longest_exact_match()
returns the longest consecutive exact match
def longest_exact_match(cigar: CigarTuples) -> int:
Args
- cigar (CigarTuples): the cigar tuples
Returns
int
score()
scoring based on sw alignment properties with gap extension penalties
def score(cigar: CigarTuples, **kwargs) -> int:
Args
- cigar (CigarTuples): list of cigar tuple values
Returns
int
: the score value
match_percent()
calculates the percent of aligned bases (matches or mismatches) that are matches
def match_percent(cigar: CigarTuples) -> float:
Args
- cigar (CigarTuples)
Returns
float
join()
given a number of cigar lists, joins them and merges any consecutive tuples with the same cigar value
def join(*pos):
Examples
>>> join([(1, 1), (4, 7)], [(4, 3), (2, 4)])
[(1, 1), (4, 10), (2, 4)]
extend_softclipping()
given some input cigar, extends softclipping if there are mismatches/insertions/deletions close to the end of the aligned portion. The stopping point is defined by the min_exact_to_stop_softclipping parameter. this function will throw an error if there is no exact match aligned portion to signal stop
def extend_softclipping(
cigar: CigarTuples, min_exact_to_stop_softclipping: int
) -> Tuple[CigarTuples, int]:
Args
- cigar (CigarTuples)
- min_exact_to_stop_softclipping (
int
): number of exact matches to terminate extension
Returns
- Tuple[CigarTuples,
int
]: new cigar list and shift from the original start position
compute()
given a ref and alt sequence compute the cigar string representing the alt
returns the cigar tuples along with the start position of the alt relative to the ref
def compute(
ref: str, alt: str, force_softclipping: bool = True, min_exact_to_stop_softclipping: int = 6
) -> Tuple[CigarTuples, int]:
Args
- ref (
str
) - alt (
str
) - force_softclipping (
bool
) - min_exact_to_stop_softclipping (
int
)
Returns
- Tuple[CigarTuples,
int
]
convert_for_igv()
igv does not support the extended CIGAR values for match v mismatch
def convert_for_igv(cigar: CigarTuples) -> CigarTuples:
Args
- cigar (CigarTuples)
Returns
Examples
>>> convert_for_igv([(7, 4), (8, 1), (7, 5)])
[(0, 10)]
alignment_matches()
counts the number of aligned bases irrespective of match/mismatch this is equivalent to counting all CIGAR.M
def alignment_matches(cigar: CigarTuples) -> int:
Args
- cigar (CigarTuples)
Returns
int
merge_indels()
For a given cigar tuple, merges adjacent insertions/deletions
def merge_indels(cigar: CigarTuples) -> CigarTuples:
Args
- cigar (CigarTuples)
Returns
Examples
>>> merge_indels([(CIGAR.EQ, 10), (CIGAR.I, 3), (CIGAR.D, 4), (CIGAR.I, 2), (CIGAR.D, 2), (CIGAR.EQ, 10)])
[(CIGAR.EQ, 10), (CIGAR.I, 5), (CIGAR.D, 6), (CIGAR.EQ, 10)]
hgvs_standardize_cigar()
extend alignments as long as matches are possible. call insertions before deletions
def hgvs_standardize_cigar(read: pysam.AlignedSegment, reference_seq: str) -> CigarTuples:
Args
- read (
pysam.AlignedSegment
) - reference_seq (
str
)
Returns
convert_string_to_cigar()
Given a cigar string, converts it to the appropriate cigar tuple
def convert_string_to_cigar(string: str) -> CigarTuples:
Args
- string (
str
)
Returns
Examples
>>> convert_string_to_cigar('8M2I1D9X')
[(CIGAR.M, 8), (CIGAR.I, 2), (CIGAR.D, 1), (CIGAR.X, 9)]
merge_internal_events()
merges events (insertions, deletions, mismatches) within a cigar if they are between exact matches on either side (anchors) and separated by less exact matches than the given parameter
does not merge two mismatches, must contain a deletion/insertion
def merge_internal_events(
cigar: CigarTuples, inner_anchor: int = 10, outer_anchor: int = 10
) -> CigarTuples:
Args
- cigar (CigarTuples): a list of tuples of cigar states and counts
- inner_anchor (
int
): minimum number of consecutive exact matches separating events - outer_anchor (
int
): minimum consecutively aligned exact matches to anchor an end for merging
Returns
- CigarTuples: new list of cigar tuples with merged events
Examples
>>> merge_internal_events([(CIGAR.EQ, 10), (CIGAR.X, 1), (CIGAR.EQ, 2), (CIGAR.D, 1), (CIGAR.EQ, 10)])
[(CIGAR.EQ, 10), (CIGAR.I, 3), (CIGAR.D, 4), (CIGAR.EQ, 10)]