mavis/constants
module responsible for small utility functions and constants used throughout the structural_variant package
PROGNAME
PROGNAME: str = 'mavis'
EXIT_OK
EXIT_OK: int = 0
EXIT_ERROR
EXIT_ERROR: int = 1
EXIT_INCOMPLETE
EXIT_INCOMPLETE: int = 2
COMPLETE_STAMP
COMPLETE_STAMP: str = 'MAVIS.COMPLETE'
CODON_SIZE
CODON_SIZE: int = 3
GAP
GAP: str = '-'
NA_MAPPING_QUALITY
NA_MAPPING_QUALITY: int = 255
DNA_ALPHABET
DNA_ALPHABET = alphabet = Gapped(ambiguous_dna, '-')
alphabet
DNA_ALPHABET = alphabet = Gapped(ambiguous_dna, '-')
DNA_ALPHABET.match
DNA_ALPHABET.match = lambda x, y: _match_ambiguous_dna(x, y)
START_AA
START_AA: str = 'M'
STOP_AA
STOP_AA: str = '*'
INTEGER_COLUMNS
INTEGER_COLUMNS = {
COLUMNS.break1_position_end,
COLUMNS.break1_position_start,
COLUMNS.break2_position_end,
COLUMNS.break2_position_start,
FLOAT_COLUMNS
FLOAT_COLUMNS = {
COLUMNS.break1_ewindow_count,
COLUMNS.break1_split_reads_forced,
COLUMNS.break1_split_reads,
COLUMNS.break2_ewindow_count,
COLUMNS.break2_split_reads_forced,
COLUMNS.break2_split_reads,
COLUMNS.cluster_size,
COLUMNS.contig_alignment_query_consumption,
COLUMNS.contig_alignment_rank,
COLUMNS.contig_alignment_score,
COLUMNS.contig_break1_read_depth,
COLUMNS.contig_break2_read_depth,
COLUMNS.contig_build_score,
COLUMNS.contig_read_depth,
COLUMNS.contig_remap_score,
COLUMNS.contig_remapped_reads,
COLUMNS.contigs_assembled,
COLUMNS.flanking_pairs_compatible,
COLUMNS.flanking_pairs,
COLUMNS.linking_split_reads,
COLUMNS.raw_break1_half_mapped_reads,
COLUMNS.raw_break1_split_reads,
COLUMNS.raw_break2_half_mapped_reads,
COLUMNS.raw_break2_split_reads,
COLUMNS.raw_flanking_pairs,
COLUMNS.raw_spanning_reads,
COLUMNS.repeat_count,
COLUMNS.spanning_reads,
BOOLEAN_COLUMNS
BOOLEAN_COLUMNS = {COLUMNS.opposing_strands, COLUMNS.stranded, COLUMNS.supplementary_call}
SUMMARY_LIST_COLUMNS
SUMMARY_LIST_COLUMNS = {
COLUMNS.annotation_figure,
COLUMNS.annotation_id,
COLUMNS.break1_split_reads,
COLUMNS.break2_split_reads,
COLUMNS.call_method,
COLUMNS.contig_alignment_score,
COLUMNS.contig_remapped_reads,
COLUMNS.contig_seq,
COLUMNS.event_type,
COLUMNS.flanking_pairs,
COLUMNS.pairing,
COLUMNS.product_id,
COLUMNS.spanning_reads,
COLUMNS.tools,
COLUMNS.tools,
COLUMNS.tracking_id,
COLUMNS.dgv,
COLUMNS.known_sv_count,
class SPLICE_TYPE
inherits MavisNamespace
holds controlled vocabulary for allowed splice type classification values
Attributes
- RETAIN (
str
): an intron was retained - SKIP (
str
): an exon was skipped - NORMAL (
str
): no exons were skipped and no introns were retained. the normal/expected splicing pattern was followed - MULTI_RETAIN (
str
): multiple introns were retained - MULTI_SKIP (
str
): multiple exons were skipped - COMPLEX (
str
): some combination of exon skipping and intron retention
class ORIENT
inherits MavisNamespace
holds controlled vocabulary for allowed orientation values
Attributes
- LEFT (
str
): left wrt to the positive/forward strand - RIGHT (
str
): right wrt to the positive/forward strand - NS (
str
): orientation is not specified
class PROTOCOL
inherits MavisNamespace
holds controlled vocabulary for allowed protocol values
Attributes
- GENOME (
str
) - TRANS (
str
)
class DISEASE_STATUS
inherits MavisNamespace
holds controlled vocabulary for allowed disease status
Attributes
- DISEASED (
str
) - NORMAL (
str
)
class STRAND
inherits MavisNamespace
holds controlled vocabulary for allowed strand values
Attributes
- POS (
str
): the positive/forward strand - NEG (
str
): the negative/reverse strand - NS (
str
): strand is not specified
class SVTYPE
inherits MavisNamespace
holds controlled vocabulary for acceptable structural variant classifications
Attributes
- ITRANS (
str
) - INV (
str
) - INS (
str
) - DUP (
str
)
class CIGAR
inherits MavisNamespace
Enum-like. For readable cigar values
Attributes
- M: alignment match (can be a sequence match or mismatch)
- I: insertion to the reference
- D: deletion from the reference
- N: skipped region from the reference
- S: soft clipping (clipped sequences present in SEQ)
- H: hard clipping (clipped sequences NOT present in SEQ)
- P: padding (silent deletion from padded reference)
- EQ: sequence match (=)
- X: sequence mismatch
Note
descriptions are taken from the samfile documentation <https://samtools.github.io/hts-specs/SAMv1.pdf>
_
class PYSAM_READ_FLAGS
inherits MavisNamespace
Enum-like. For readable PYSAM flag constants
Attributes
- REVERSE (
int
): SEQ being reverse complemented - MATE_REVERSE (
int
): SEQ of the next segment in the template being reverse complemented - UNMAPPED (
int
): segment unmapped - MATE_UNMAPPED (
int
): next segment in the template unmapped - FIRST_IN_PAIR (
int
): the first segment in the template - LAST_IN_PAIR (
int
): the last segment in the template - SECONDARY (
int
): secondary alignment - MULTIMAP (
int
): template having multiple segments in sequencing - SUPPLEMENTARY (
int
): supplementary alignment - TARGETED_ALIGNMENT (
str
) - RECOMPUTED_CIGAR (
str
) - BLAT_RANK (
str
) - BLAT_SCORE (
str
) - BLAT_ALIGNMENTS (
str
) - BLAT_PERCENT_IDENTITY (
str
) - BLAT_PMS (
str
)
Note
descriptions are taken from the samfile documentation <https://samtools.github.io/hts-specs/SAMv1.pdf>
_
class FLAGS
inherits MavisNamespace
Attributes
- LQ (
str
)
class READ_PAIR_TYPE
inherits MavisNamespace
Attributes
- RR (
str
) - LL (
str
) - RL (
str
) - LR (
str
)
class CALL_METHOD
inherits MavisNamespace
holds controlled vocabulary for allowed call methods
Attributes
- CONTIG (
str
): a contig was assembled and aligned across the breakpoints - SPLIT (
str
): the event was called by split read - FLANK (
str
): the event was called by flanking read pair - SPAN (
str
): the event was called by spanning read - INPUT (
str
)
class GENE_PRODUCT_TYPE
inherits MavisNamespace
controlled vocabulary for gene products
Attributes
- SENSE (
str
): the gene product is a sense fusion - ANTI_SENSE (
str
): the gene product is anti-sense
class PRIME
inherits MavisNamespace
Attributes
- FIVE (
int
): five prime - THREE (
int
): three prime
class GIEMSA_STAIN
inherits MavisNamespace
holds controlled vocabulary relating to stains of chromosome bands
Attributes
- GNEG (
str
) - GPOS33 (
str
) - GPOS50 (
str
) - GPOS66 (
str
) - GPOS75 (
str
) - GPOS25 (
str
) - GPOS100 (
str
) - ACEN (
str
) - GVAR (
str
) - STALK (
str
)
class COLUMNS
inherits MavisNamespace
Column names for i/o files used throughout the pipeline
Attributes
- tracking_id (
str
) - library (
str
) - cluster_id (
str
) - cluster_size (
str
) - dgv (
str
) - validation_id (
str
) - annotation_id (
str
) - product_id (
str
) - event_type (
str
) - pairing (
str
) - inferred_pairing (
str
) - gene1 (
str
) - gene1_direction (
str
) - gene2 (
str
) - gene2_direction (
str
) - gene1_aliases (
str
) - gene2_aliases (
str
) - gene_product_type (
str
) - transcript1 (
str
) - transcript2 (
str
) - fusion_splicing_pattern (
str
) - fusion_cdna_coding_start (
str
) - fusion_cdna_coding_end (
str
) - fusion_mapped_domains (
str
) - fusion_sequence_fasta_id (
str
) - fusion_sequence_fasta_file (
str
) - fusion_protein_hgvs (
str
) - annotation_figure (
str
) - annotation_figure_legend (
str
) - genes_encompassed (
str
) - genes_overlapping_break1 (
str
) - genes_overlapping_break2 (
str
) - genes_proximal_to_break1 (
str
) - genes_proximal_to_break2 (
str
) - break1_chromosome (
str
) - break1_position_start (
str
) - break1_position_end (
str
) - break1_orientation (
str
) - exon_last_5prime (
str
) - exon_first_3prime (
str
) - break1_strand (
str
) - break1_seq (
str
) - break2_chromosome (
str
) - break2_position_start (
str
) - break2_position_end (
str
) - break2_orientation (
str
) - break2_strand (
str
) - break2_seq (
str
) - opposing_strands (
str
) - stranded (
str
) - protocol (
str
) - disease_status (
str
) - tools (
str
) - call_method (
str
) - break1_ewindow (
str
) - break1_ewindow_count (
str
) - break1_homologous_seq (
str
) - break1_split_read_names (
str
) - break1_split_reads (
str
) - break1_split_reads_forced (
str
) - break2_ewindow (
str
) - break2_ewindow_count (
str
) - break2_homologous_seq (
str
) - break2_split_read_names (
str
) - break2_split_reads (
str
) - break2_split_reads_forced (
str
) - contig_alignment_query_consumption (
str
) - contig_alignment_score (
str
) - contig_alignment_query_name (
str
) - contig_read_depth (
str
) - contig_break1_read_depth (
str
) - contig_break2_read_depth (
str
) - contig_alignment_rank (
str
) - contig_build_score (
str
) - contig_remap_score (
str
) - contig_remap_coverage (
str
) - contig_remapped_read_names (
str
) - contig_remapped_reads (
str
) - contig_seq (
str
) - contig_strand_specific (
str
) - contigs_assembled (
str
) - call_sequence_complexity (
str
) - known_sv_count (
str
) - spanning_reads (
str
) - spanning_read_names (
str
) - flanking_median_fragment_size (
str
) - flanking_pairs (
str
) - flanking_pairs_compatible (
str
) - flanking_pairs_read_names (
str
) - flanking_pairs_compatible_read_names (
str
) - flanking_stdev_fragment_size (
str
) - linking_split_read_names (
str
) - linking_split_reads (
str
) - raw_break1_half_mapped_reads (
str
) - raw_break1_split_reads (
str
) - raw_break2_half_mapped_reads (
str
) - raw_break2_split_reads (
str
) - raw_flanking_pairs (
str
) - raw_spanning_reads (
str
) - untemplated_seq (
str
) - filter_comment (
str
) - cdna_synon (
str
) - protein_synon (
str
) - supplementary_call (
str
) - net_size (
str
) - repeat_count (
str
) - assumed_untemplated (
str
)
float_fraction()
cast input to a float
def float_fraction(num):
Args
- num: input to cast
Returns
: float
Raises
TypeError
: if the input cannot be cast to a float or the number is not between 0 and 1
reverse_complement()
wrapper for the Bio.Seq reverse_complement method
def reverse_complement(s: str) -> str:
Args
- s (
str
): the input DNA sequence
Returns
str
: the reverse complement of the input sequence
Examples
>>> reverse_complement('ATCCGGT')
'ACCGGAT'
Warning
assumes the input is a DNA sequence
translate()
given a DNA sequence, translates it and returns the protein amino acid sequence
def translate(s: str, reading_frame: int = 0) -> str:
Args
- s (
str
): the input DNA sequence - reading_frame (
int
): where to start translating the sequence
Returns
str
: the amino acid sequence