file_io module¶

module which holds all functions relating to loading reference files

class mavis.annotate.file_io.ReferenceFile(file_type, *filepaths, eager_load=False, assert_exists=False, **opt)[source]¶

Bases: object

Parameters:	filepaths (str) – list of paths to load file_type* (str) – Type of file to load eager_load (bool=False) – load the files immeadiately assert_exists (bool=False) – check that all files exist **opt – key word arguments to be passed to the load function and used as part of the file cache key

Raises: FileNotFoundError: when assert_exists and an input does not exist

CACHE = {}¶

LOAD_FUNCTIONS = {'aligner_reference': None, 'annotations': <function load_annotations>, 'dgv_annotation': <function load_masking_regions>, 'masking': <function load_masking_regions>, 'reference_genome': <function load_reference_genome>, 'template_metadata': <function load_templates>}¶

Mapping of file types (based on ENV name) to load functions

Type:	`dict`

files_exist(not_empty=False)[source]¶

is_empty()[source]¶

is_loaded()[source]¶

load(ignore_cache=False, verbose=True)[source]¶: load (or return) the contents of a reference file and add it to the cache if enabled

mavis.annotate.file_io.convert_tab_to_json(filepath, warn=<mavis.util.Log object>)[source]¶

given a file in the std input format (see below) reads and return a list of genes (and sub-objects)

column name	example	description
ensembl_transcript_id	ENST000001
ensembl_gene_id	ENSG000001
strand	-1	positive or negative 1
cdna_coding_start	44	where translation begins relative to the start of the cdna
cdna_coding_end	150	where translation terminates
genomic_exon_ranges	100-201;334-412;779-830	semi-colon demitited exon start/ends
AA_domain_ranges	DBD:220-251,260-271	semi-colon delimited list of domains
hugo_names	KRAS	hugo gene name

Parameters:	filepath (str) – path to the input tab-delimited file
Returns:	a dictionary keyed by chromosome name with values of list of genes on the chromosome
Return type:	`dict` of `list` of `Gene` by `str`

Example

>>> ref = load_reference_genes('filename')
>>> ref['1']
[Gene(), Gene(), ....]

Warning

does not load translations unless then start with ‘M’, end with ‘*’ and have a length of multiple 3

mavis.annotate.file_io.load_annotations(*filepaths, warn=<mavis.util.Log object>, reference_genome=None, best_transcripts_only=False)[source]¶

loads gene models from an input file. Expects a tabbed or json file.

Parameters:	filepath (str) – path to the input file verbose (bool) – output extra information to stdout reference_genome (`dict` of `Bio.SeqRecord` by `str`) – dict of reference sequence by template/chr name filetype (str) – json or tab/tsv. only required if the file type can’t be interpolated from the path extension
Returns:	lists of genes keyed by chromosome name
Return type:	`dict` of `list` of `Gene` by `str`

mavis.annotate.file_io.load_masking_regions(*filepaths)[source]¶

reads a file of regions. The expect input format for the file is tab-delimited and the header should contain the following columns

chr: the chromosome
start: start of the region, 1-based inclusive
end: end of the region, 1-based inclusive
name: the name/label of the region

For example:

#chr    start   end     name
chr20   25600000        27500000        centromere

Parameters:	filepath (str) – path to the input tab-delimited file
Returns:	a dictionary keyed by chromosome name with values of lists of regions on the chromosome
Return type:	`dict` of `list` of `BioInterval` by `str`

Example

>>> m = load_masking_regions('filename')
>>> m['1']
[BioInterval(), BioInterval(), ...]

mavis.annotate.file_io.load_reference_genes(*pos, **kwargs)[source]¶: Deprecated Use load_annotations() instead

mavis.annotate.file_io.load_reference_genome(*filepaths)[source]¶

Parameters:	filepaths (list of str) – the paths to the files containing the input fasta genomes
Returns:	a dictionary representing the sequences in the fasta file
Return type:	`dict` of `Bio.SeqRecord` by `str`

mavis.annotate.file_io.load_templates(*filepaths)[source]¶

primarily useful if template drawings are required and is not necessary otherwise assumes the input file is 0-indexed with [start,end) style. Columns are expected in the following order, tab-delimited. A header should not be given

name
start
end
band_name
giemsa_stain

for example

chr1    0       2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25

Parameters:	filename (str) – the path to the file with the cytoband template information
Returns:	list of the templates loaded
Return type:	`list` of `Template`

mavis.annotate.file_io.parse_annotations_json(data, reference_genome=None, best_transcripts_only=False, warn=<mavis.util.Log object>)[source]¶: parses a json of annotation information into annotation objects