file_io module

module which holds all functions relating to loading reference files

class mavis.annotate.file_io.ReferenceFile(file_type, *filepaths, eager_load=False, assert_exists=False, **opt)[source]

Bases: object

Parameters:
  • *filepaths (str) – list of paths to load
  • file_type (str) – Type of file to load
  • eager_load (bool=False) – load the files immeadiately
  • assert_exists (bool=False) – check that all files exist
  • **opt – key word arguments to be passed to the load function and used as part of the file cache key
Raises
FileNotFoundError: when assert_exists and an input does not exist
CACHE = {}
LOAD_FUNCTIONS = {'aligner_reference': None, 'annotations': <function load_annotations>, 'dgv_annotation': <function load_masking_regions>, 'masking': <function load_masking_regions>, 'reference_genome': <function load_reference_genome>, 'template_metadata': <function load_templates>}

Mapping of file types (based on ENV name) to load functions

Type:dict
files_exist(not_empty=False)[source]
is_empty()[source]
is_loaded()[source]
load(ignore_cache=False, verbose=True)[source]

load (or return) the contents of a reference file and add it to the cache if enabled

mavis.annotate.file_io.convert_tab_to_json(filepath, warn=<mavis.util.Log object>)[source]

given a file in the std input format (see below) reads and return a list of genes (and sub-objects)

column name example description
ensembl_transcript_id ENST000001  
ensembl_gene_id ENSG000001  
strand -1 positive or negative 1
cdna_coding_start 44 where translation begins relative to the start of the cdna
cdna_coding_end 150 where translation terminates
genomic_exon_ranges 100-201;334-412;779-830 semi-colon demitited exon start/ends
AA_domain_ranges DBD:220-251,260-271 semi-colon delimited list of domains
hugo_names KRAS hugo gene name
Parameters:filepath (str) – path to the input tab-delimited file
Returns:a dictionary keyed by chromosome name with values of list of genes on the chromosome
Return type:dict of list of Gene by str

Example

>>> ref = load_reference_genes('filename')
>>> ref['1']
[Gene(), Gene(), ....]

Warning

does not load translations unless then start with ‘M’, end with ‘*’ and have a length of multiple 3

mavis.annotate.file_io.load_annotations(*filepaths, warn=<mavis.util.Log object>, reference_genome=None, best_transcripts_only=False)[source]

loads gene models from an input file. Expects a tabbed or json file.

Parameters:
  • filepath (str) – path to the input file
  • verbose (bool) – output extra information to stdout
  • reference_genome (dict of Bio.SeqRecord by str) – dict of reference sequence by template/chr name
  • filetype (str) – json or tab/tsv. only required if the file type can’t be interpolated from the path extension
Returns:

lists of genes keyed by chromosome name

Return type:

dict of list of Gene by str

mavis.annotate.file_io.load_masking_regions(*filepaths)[source]

reads a file of regions. The expect input format for the file is tab-delimited and the header should contain the following columns

  • chr: the chromosome
  • start: start of the region, 1-based inclusive
  • end: end of the region, 1-based inclusive
  • name: the name/label of the region

For example:

#chr    start   end     name
chr20   25600000        27500000        centromere
Parameters:filepath (str) – path to the input tab-delimited file
Returns:a dictionary keyed by chromosome name with values of lists of regions on the chromosome
Return type:dict of list of BioInterval by str

Example

>>> m = load_masking_regions('filename')
>>> m['1']
[BioInterval(), BioInterval(), ...]
mavis.annotate.file_io.load_reference_genes(*pos, **kwargs)[source]

Deprecated Use load_annotations() instead

mavis.annotate.file_io.load_reference_genome(*filepaths)[source]
Parameters:filepaths (list of str) – the paths to the files containing the input fasta genomes
Returns:a dictionary representing the sequences in the fasta file
Return type:dict of Bio.SeqRecord by str
mavis.annotate.file_io.load_templates(*filepaths)[source]

primarily useful if template drawings are required and is not necessary otherwise assumes the input file is 0-indexed with [start,end) style. Columns are expected in the following order, tab-delimited. A header should not be given

  1. name
  2. start
  3. end
  4. band_name
  5. giemsa_stain

for example

chr1    0       2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
Parameters:filename (str) – the path to the file with the cytoband template information
Returns:list of the templates loaded
Return type:list of Template
mavis.annotate.file_io.parse_annotations_json(data, reference_genome=None, best_transcripts_only=False, warn=<mavis.util.Log object>)[source]

parses a json of annotation information into annotation objects