mavis/constants
module responsible for small utility functions and constants used throughout the structural_variant package
PROGNAME
PROGNAME: str = 'mavis'
EXIT_OK
EXIT_OK: int = 0
EXIT_ERROR
EXIT_ERROR: int = 1
EXIT_INCOMPLETE
EXIT_INCOMPLETE: int = 2
COMPLETE_STAMP
COMPLETE_STAMP: str = 'MAVIS.COMPLETE'
CODON_SIZE
CODON_SIZE: int = 3
GAP
GAP: str = '-'
NA_MAPPING_QUALITY
NA_MAPPING_QUALITY: int = 255
DNA_ALPHABET
DNA_ALPHABET = alphabet = Gapped(ambiguous_dna, '-')
alphabet
DNA_ALPHABET = alphabet = Gapped(ambiguous_dna, '-')
DNA_ALPHABET.match
DNA_ALPHABET.match = lambda x, y: _match_ambiguous_dna(x, y)
START_AA
START_AA: str = 'M'
STOP_AA
STOP_AA: str = '*'
INTEGER_COLUMNS
INTEGER_COLUMNS = {
COLUMNS.break1_position_end,
COLUMNS.break1_position_start,
COLUMNS.break2_position_end,
COLUMNS.break2_position_start,
FLOAT_COLUMNS
FLOAT_COLUMNS = {
COLUMNS.break1_ewindow_count,
COLUMNS.break1_split_reads_forced,
COLUMNS.break1_split_reads,
COLUMNS.break2_ewindow_count,
COLUMNS.break2_split_reads_forced,
COLUMNS.break2_split_reads,
COLUMNS.cluster_size,
COLUMNS.contig_alignment_query_consumption,
COLUMNS.contig_alignment_rank,
COLUMNS.contig_alignment_score,
COLUMNS.contig_break1_read_depth,
COLUMNS.contig_break2_read_depth,
COLUMNS.contig_build_score,
COLUMNS.contig_read_depth,
COLUMNS.contig_remap_score,
COLUMNS.contig_remapped_reads,
COLUMNS.contigs_assembled,
COLUMNS.flanking_pairs_compatible,
COLUMNS.flanking_pairs,
COLUMNS.linking_split_reads,
COLUMNS.raw_break1_half_mapped_reads,
COLUMNS.raw_break1_split_reads,
COLUMNS.raw_break2_half_mapped_reads,
COLUMNS.raw_break2_split_reads,
COLUMNS.raw_flanking_pairs,
COLUMNS.raw_spanning_reads,
COLUMNS.repeat_count,
COLUMNS.spanning_reads,
BOOLEAN_COLUMNS
BOOLEAN_COLUMNS = {COLUMNS.opposing_strands, COLUMNS.stranded, COLUMNS.supplementary_call}
SUMMARY_LIST_COLUMNS
SUMMARY_LIST_COLUMNS = {
COLUMNS.annotation_figure,
COLUMNS.annotation_id,
COLUMNS.break1_split_reads,
COLUMNS.break2_split_reads,
COLUMNS.call_method,
COLUMNS.contig_alignment_score,
COLUMNS.contig_remapped_reads,
COLUMNS.contig_seq,
COLUMNS.event_type,
COLUMNS.flanking_pairs,
COLUMNS.pairing,
COLUMNS.product_id,
COLUMNS.spanning_reads,
COLUMNS.tools,
COLUMNS.tools,
COLUMNS.tracking_id,
COLUMNS.dgv,
COLUMNS.known_sv_count,
class SPLICE_TYPE
inherits MavisNamespace
holds controlled vocabulary for allowed splice type classification values
Attributes
- RETAIN (
str): an intron was retained - SKIP (
str): an exon was skipped - NORMAL (
str): no exons were skipped and no introns were retained. the normal/expected splicing pattern was followed - MULTI_RETAIN (
str): multiple introns were retained - MULTI_SKIP (
str): multiple exons were skipped - COMPLEX (
str): some combination of exon skipping and intron retention
class ORIENT
inherits MavisNamespace
holds controlled vocabulary for allowed orientation values
Attributes
- LEFT (
str): left wrt to the positive/forward strand - RIGHT (
str): right wrt to the positive/forward strand - NS (
str): orientation is not specified
class PROTOCOL
inherits MavisNamespace
holds controlled vocabulary for allowed protocol values
Attributes
- GENOME (
str) - TRANS (
str)
class DISEASE_STATUS
inherits MavisNamespace
holds controlled vocabulary for allowed disease status
Attributes
- DISEASED (
str) - NORMAL (
str)
class STRAND
inherits MavisNamespace
holds controlled vocabulary for allowed strand values
Attributes
- POS (
str): the positive/forward strand - NEG (
str): the negative/reverse strand - NS (
str): strand is not specified
class SVTYPE
inherits MavisNamespace
holds controlled vocabulary for acceptable structural variant classifications
Attributes
- ITRANS (
str) - INV (
str) - INS (
str) - DUP (
str)
class CIGAR
inherits MavisNamespace
Enum-like. For readable cigar values
Attributes
- M: alignment match (can be a sequence match or mismatch)
- I: insertion to the reference
- D: deletion from the reference
- N: skipped region from the reference
- S: soft clipping (clipped sequences present in SEQ)
- H: hard clipping (clipped sequences NOT present in SEQ)
- P: padding (silent deletion from padded reference)
- EQ: sequence match (=)
- X: sequence mismatch
Note
descriptions are taken from the samfile documentation <https://samtools.github.io/hts-specs/SAMv1.pdf>_
class PYSAM_READ_FLAGS
inherits MavisNamespace
Enum-like. For readable PYSAM flag constants
Attributes
- REVERSE (
int): SEQ being reverse complemented - MATE_REVERSE (
int): SEQ of the next segment in the template being reverse complemented - UNMAPPED (
int): segment unmapped - MATE_UNMAPPED (
int): next segment in the template unmapped - FIRST_IN_PAIR (
int): the first segment in the template - LAST_IN_PAIR (
int): the last segment in the template - SECONDARY (
int): secondary alignment - MULTIMAP (
int): template having multiple segments in sequencing - SUPPLEMENTARY (
int): supplementary alignment - TARGETED_ALIGNMENT (
str) - RECOMPUTED_CIGAR (
str) - BLAT_RANK (
str) - BLAT_SCORE (
str) - BLAT_ALIGNMENTS (
str) - BLAT_PERCENT_IDENTITY (
str) - BLAT_PMS (
str)
Note
descriptions are taken from the samfile documentation <https://samtools.github.io/hts-specs/SAMv1.pdf>_
class FLAGS
inherits MavisNamespace
Attributes
- LQ (
str)
class READ_PAIR_TYPE
inherits MavisNamespace
Attributes
- RR (
str) - LL (
str) - RL (
str) - LR (
str)
class CALL_METHOD
inherits MavisNamespace
holds controlled vocabulary for allowed call methods
Attributes
- CONTIG (
str): a contig was assembled and aligned across the breakpoints - SPLIT (
str): the event was called by split read - FLANK (
str): the event was called by flanking read pair - SPAN (
str): the event was called by spanning read - INPUT (
str)
class GENE_PRODUCT_TYPE
inherits MavisNamespace
controlled vocabulary for gene products
Attributes
- SENSE (
str): the gene product is a sense fusion - ANTI_SENSE (
str): the gene product is anti-sense
class PRIME
inherits MavisNamespace
Attributes
- FIVE (
int): five prime - THREE (
int): three prime
class GIEMSA_STAIN
inherits MavisNamespace
holds controlled vocabulary relating to stains of chromosome bands
Attributes
- GNEG (
str) - GPOS33 (
str) - GPOS50 (
str) - GPOS66 (
str) - GPOS75 (
str) - GPOS25 (
str) - GPOS100 (
str) - ACEN (
str) - GVAR (
str) - STALK (
str)
class COLUMNS
inherits MavisNamespace
Column names for i/o files used throughout the pipeline
Attributes
- tracking_id (
str) - library (
str) - cluster_id (
str) - cluster_size (
str) - dgv (
str) - validation_id (
str) - annotation_id (
str) - product_id (
str) - event_type (
str) - pairing (
str) - inferred_pairing (
str) - gene1 (
str) - gene1_direction (
str) - gene2 (
str) - gene2_direction (
str) - gene1_aliases (
str) - gene2_aliases (
str) - gene_product_type (
str) - transcript1 (
str) - transcript2 (
str) - fusion_splicing_pattern (
str) - fusion_cdna_coding_start (
str) - fusion_cdna_coding_end (
str) - fusion_mapped_domains (
str) - fusion_sequence_fasta_id (
str) - fusion_sequence_fasta_file (
str) - fusion_protein_hgvs (
str) - annotation_figure (
str) - annotation_figure_legend (
str) - genes_encompassed (
str) - genes_overlapping_break1 (
str) - genes_overlapping_break2 (
str) - genes_proximal_to_break1 (
str) - genes_proximal_to_break2 (
str) - break1_chromosome (
str) - break1_position_start (
str) - break1_position_end (
str) - break1_orientation (
str) - exon_last_5prime (
str) - exon_first_3prime (
str) - break1_strand (
str) - break1_seq (
str) - break2_chromosome (
str) - break2_position_start (
str) - break2_position_end (
str) - break2_orientation (
str) - break2_strand (
str) - break2_seq (
str) - opposing_strands (
str) - stranded (
str) - protocol (
str) - disease_status (
str) - tools (
str) - call_method (
str) - break1_ewindow (
str) - break1_ewindow_count (
str) - break1_homologous_seq (
str) - break1_split_read_names (
str) - break1_split_reads (
str) - break1_split_reads_forced (
str) - break2_ewindow (
str) - break2_ewindow_count (
str) - break2_homologous_seq (
str) - break2_split_read_names (
str) - break2_split_reads (
str) - break2_split_reads_forced (
str) - contig_alignment_query_consumption (
str) - contig_alignment_score (
str) - contig_alignment_query_name (
str) - contig_read_depth (
str) - contig_break1_read_depth (
str) - contig_break2_read_depth (
str) - contig_alignment_rank (
str) - contig_build_score (
str) - contig_remap_score (
str) - contig_remap_coverage (
str) - contig_remapped_read_names (
str) - contig_remapped_reads (
str) - contig_seq (
str) - contig_strand_specific (
str) - contigs_assembled (
str) - call_sequence_complexity (
str) - known_sv_count (
str) - spanning_reads (
str) - spanning_read_names (
str) - flanking_median_fragment_size (
str) - flanking_pairs (
str) - flanking_pairs_compatible (
str) - flanking_pairs_read_names (
str) - flanking_pairs_compatible_read_names (
str) - flanking_stdev_fragment_size (
str) - linking_split_read_names (
str) - linking_split_reads (
str) - raw_break1_half_mapped_reads (
str) - raw_break1_split_reads (
str) - raw_break2_half_mapped_reads (
str) - raw_break2_split_reads (
str) - raw_flanking_pairs (
str) - raw_spanning_reads (
str) - untemplated_seq (
str) - filter_comment (
str) - cdna_synon (
str) - protein_synon (
str) - supplementary_call (
str) - net_size (
str) - repeat_count (
str) - assumed_untemplated (
str)
float_fraction()
cast input to a float
def float_fraction(num):
Args
- num: input to cast
Returns
: float
Raises
TypeError: if the input cannot be cast to a float or the number is not between 0 and 1
reverse_complement()
wrapper for the Bio.Seq reverse_complement method
def reverse_complement(s: str) -> str:
Args
- s (
str): the input DNA sequence
Returns
str: the reverse complement of the input sequence
Examples
>>> reverse_complement('ATCCGGT')
'ACCGGAT'
Warning
assumes the input is a DNA sequence
translate()
given a DNA sequence, translates it and returns the protein amino acid sequence
def translate(s: str, reading_frame: int = 0) -> str:
Args
- s (
str): the input DNA sequence - reading_frame (
int): where to start translating the sequence
Returns
str: the amino acid sequence