genomic module

class mavis.annotate.genomic.Exon(start, end, transcript=None, name=None, intact_start_splice=True, intact_end_splice=True, seq=None, strand=None)[source]

Bases: mavis.annotate.base.BioInterval

Parameters:
  • start (int) – the genomic start position
  • end (int) – the genomic end position
  • name (str) – the name of the exon
  • transcript (PreTranscript) – the ‘parent’ transcript this exon belongs to
  • intact_start_splice (bool) – if the starting splice site has been abrogated
  • intact_end_splice (bool) – if the end splice site has been abrogated
Raises:

AttributeError – if the exon start > the exon end

Example

>>> Exon(15, 78)
acceptor

returns the genomic exonic position of the acceptor splice site

Type:int
acceptor_splice_site

the genomic range describing the splice site

Type:Interval
donor

returns the genomic exonic position of the donor splice site

Type:int
donor_splice_site

the genomic range describing the splice site

Type:Interval
transcript

the transcript this exon belongs to

Type:PreTranscript
class mavis.annotate.genomic.Gene(chr, start, end, name=None, strand='?', aliases=None, seq=None)[source]

Bases: mavis.annotate.base.BioInterval

Parameters:
  • chr (str) – the chromosome
  • name (str) – the gene name/id i.e. ENSG0001
  • strand (STRAND) – the genomic strand ‘+’ or ‘-‘
  • aliases (list of str) – a list of aliases. For example the hugo name could go here
  • seq (str) – genomic seq of the gene

Example

>>> Gene('X', 1, 1000, 'ENG0001', '+', ['KRAS'])
chr

returns the name of the chromosome that this gene resides on

get_seq(reference_genome, ignore_cache=False)[source]

gene sequence is always given wrt to the positive forward strand regardless of gene strand

Parameters:
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input reference_genome
Returns:

the sequence of the gene

Return type:

str

key()[source]

see structural_variant.annotate.base.BioInterval.key()

spliced_transcripts

list of transcripts

Type:list of Transcript
to_dict()[source]

see structural_variant.annotate.base.BioInterval.to_dict()

transcript_priority(transcript)[source]

prioritizes transcripts from 0 to n-1 based on best transcript flag and then alphanumeric name sort

Warning

Lower number means higher priority. This is to make sort work by default

transcripts

list of unspliced transcripts

Type:list of PreTranscript
translations

list of translations

Type:list of Translation
class mavis.annotate.genomic.IntergenicRegion(chr, start, end, strand)[source]

Bases: mavis.annotate.base.BioInterval

Parameters:
  • chr (str) – the reference object/chromosome for this region
  • start (int) – the start of the IntergenicRegion
  • end (int) – the end of the IntergenicRegion
  • strand (STRAND) – the strand the region is defined on

Example

>>> IntergenicRegion('1', 1, 100, '+')
chr

returns the name of the chromosome that this region resides on

key()[source]

see structural_variant.annotate.base.BioInterval.key()

to_dict()[source]

see structural_variant.annotate.base.BioInterval.to_dict()

class mavis.annotate.genomic.PreTranscript(exons, gene=None, name=None, strand=None, spliced_transcripts=None, seq=None, is_best_transcript=False)[source]

Bases: mavis.annotate.base.BioInterval

creates a new transcript object

Parameters:
  • exons (list of Exon) – list of Exon that make up the transcript
  • genomic_start (int) – genomic start position of the transcript
  • genomic_end (int) – genomic end position of the transcript
  • gene (Gene) – the gene this transcript belongs to
  • name (str) – name of the transcript
  • strand (STRAND) – strand the transcript is on, defaults to the strand of the Gene if not specified
  • seq (str) – unspliced cDNA seq
convert_cdna_to_genomic(pos, splicing_pattern)[source]
Parameters:
  • pos (int) – cdna position
  • splicing_pattern (SplicingPattern) – list of genomic splice sites 3‘5’ repeating
Returns:

the genomic equivalent

Return type:

int

convert_genomic_to_cdna(pos, splicing_pattern)[source]
Parameters:
  • pos (int) – the genomic position to be converted
  • splicing_pattern (SplicingPattern) – list of genomic splice sites 3‘5’ repeating
Returns:

the cdna equivalent

Return type:

int

Raises:

IndexError – when a genomic position not present in the cdna is attempted to be converted

convert_genomic_to_nearest_cdna(pos, splicing_pattern, stick_direction=None, allow_outside=True)[source]

converts a genomic position to its cdna equivalent or (if intronic) the nearest cdna and shift

Parameters:
  • pos (int) – the genomic position
  • splicing_pattern (SplicingPattern) – the splicing pattern
Returns:

  • int - the exonic cdna position
  • int - the intronic shift

Return type:

tuple of int and int

exon_number(exon)[source]

exon numbering is based on the direction of translation

Parameters:exon (Exon) – the exon to be numbered
Returns:the exon number (1 based)
Return type:int
Raises:AttributeError – if the strand is not given or the exon does not belong to the transcript
gene

the gene this transcript belongs to

Type:Gene
generate_splicing_patterns()[source]

returns a list of splice sites to be connected as a splicing pattern

Returns:List of positions to be spliced together
Return type:list of SplicingPattern

see theory - predicting splicing patterns

get_cdna_seq(splicing_pattern, reference_genome=None, ignore_cache=False)[source]
Parameters:
  • splicing_pattern (SplicingPattern) – the list of splicing positions
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input reference_genome
Returns:

the spliced cDNA sequence

Return type:

str

get_seq(reference_genome=None, ignore_cache=False)[source]
Parameters:
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input reference_genome
Returns:

the sequence of the transcript including introns (but relative to strand)

Return type:

str

transcripts

list of spliced transcripts

Type:list of Transcript
translations

list of translations associated with this transcript

Type:list of Translation
class mavis.annotate.genomic.Template(name, start, end, seq=None, bands=None)[source]

Bases: mavis.annotate.base.BioInterval

class mavis.annotate.genomic.Transcript(pre_transcript, splicing_patt, seq=None, translations=None)[source]

Bases: mavis.annotate.base.BioInterval

splicing pattern is given in genomic coordinates

Parameters:
  • pre_transcript (PreTranscript) – the unspliced transcript
  • splicing_patt (list of int) – the list of splicing positions
  • seq (str) – the cdna sequence
  • translations (list of Translation) – the list of translations of this transcript
convert_cdna_to_genomic(pos)[source]
Parameters:pos (int) – cdna position
Returns:the genomic equivalent
Return type:int
convert_genomic_to_cdna(pos)[source]
Parameters:pos (int) – the genomic position to be converted
Returns:the cdna equivalent
Return type:int
Raises:IndexError – when a genomic position not present in the cdna is attempted to be converted
convert_genomic_to_nearest_cdna(pos, **kwargs)[source]
get_seq(reference_genome=None, ignore_cache=False)[source]
Parameters:
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input reference_genome
Returns:

the sequence corresponding to the spliced cdna

Return type:

str

unspliced_transcript

the unspliced transcript this splice variant belongs to

Type:PreTranscript