Contents¶
About¶
MAVIS is python command-line tool for the post-processing of structural variant calls. The general MAVIS pipeline consists of six main stages
Getting Help¶
All steps in the MAVIS pipeline are called following the main mavis entry point. The usage menu can be viewed by running without any arguments, or by giving the -h/–help option
mavis -h
Help sub-menus can be found by giving the pipeline step followed by no arguments or the -h options
mavis cluster -h
Common problems and questions are addressed on the wiki. If you have a question or issue that is not answered there (or already a github issue) please submit a github issue to our github page or contact us by email at mavis@bcgsc.ca
Install Instructions¶
There are 3 major steps to setting up and installing MAVIS.
1. Install Aligner¶
In addition to the python package dependencies, MAVIS also requires an aligner to be installed. Currently the only aligners supported are blat and bwa mem. For MAVIS to run successfully the aligner must be installed and accessible on the path. If you have a non-standard install you may find it useful to edit the PATH environment variable. For example
export PATH=/path/to/directory/containing/blat/binary:$PATH
blat is the default aligner. To configure MAVIS to use bwa mem as a default instead, use the MAVIS environment variables. Make sure to specify BOTH of the variables below to change the default aligner.
export MAVIS_ALIGNER='bwa mem'
export MAVIS_ALIGNER_REFERENCE=/path/to/mem/fasta/ref/file
After this has been installed MAVIS itself can be installed through pip
2. Install MAVIS¶
Install using pip¶
The easiest way to install MAVIS is through the python package manager, pip. If you do not have python3 installed it can be found here
Ensuring you have a recent version of pip and setuptools will improve the install experience. Older versions of pip and setuptools may have issues with obtaining some of the mavis python dependencies
pip install --upgrade pip setuptools
or (for Anaconda users)
conda update pip setuptools
If this is not a clean/new python install it may be useful to set up mavis in a virtual python environment
Then install mavis itself
pip install mavis
This will install mavis and its python dependencies.
Install using Buildout¶
Alternatively you can use the bootstrap/buildout to install mavis into bin/mavis
git clone https://github.com/bcgsc/mavis.git
cd mavis
pip install zc.buildout
python bootstrap.py
bin/buildout
This will install mavis and its python dependencies into eggs inside the cloned mavis directory which can be used by simply running bin/mavis
3. Build or Download Reference Files¶
After MAVIS is installed the reference files must be generated (or downloaded) before it can be run. A simple bash script to download the hg19 reference files and generate a MAVIS environment file is provided under mavis/tools for convenience.
cd /path/to/where/you/want/to/put/the/files
wget https://raw.githubusercontent.com/bcgsc/mavis/master/tools/get_hg19_reference_files.sh
bash get_hg19_reference_files.sh
source reference_inputs/hg19_env.sh
Once the above 3 steps are complete MAVIS is ready to be run. See the MAVIS tutorial to learn about running MAVIS.
Build the Sphinx Documentation¶
pip install .[docs]
sphinx-build docs/source/ html
Deploy to PyPi¶
Install m2r to ensure the README is converted nicely
pip install m2r
Install to build the egg
python setup.py install
Build the other distribution files
python setup.py sdist
Use twine to upload
twine upload -r pypi dist/*
Citation¶
If you use MAVIS as a part of your project please cite
Inputs¶
Reference Input Files¶
There are several reference files that are required for full functionality of the MAVIS pipeline. If the same reference file will be reused often then the user may find it helpful to set reasonable defaults. Default values for any of the reference file arguments can be configured through environment variables.
To improve the install experience for the users, different configurations of the MAVIS annotations file have been made available. These files can be downloaded below, or if the required configuration is not available, instructions on generating the annotations file can be found below.
File Name (Type/Format) | Environment Variable | Download |
---|---|---|
reference genome (fasta) | MAVIS_REFERENCE_GENOME |
|
annotations (JSON) | MAVIS_ANNOTATIONS |
|
masking (text/tabbed) | MAVIS_MASKING |
|
template metadata (text/tabbed) | MAVIS_TEMPLATE_METADATA |
|
DGV annotations (text/tabbed) | MAVIS_DGV_ANNOTATION |
|
aligner reference | MAVIS_ALIGNER_REFERENCE |
If the environment variables above are set they will be used as the default values when any step of the pipeline script is called (including generating the template config file)
Reference Genome¶
These are the sequence files in fasta format that are used in aligning and generating the fusion sequences.
Template Metadata¶
This is the file which contains the band information for the chromosomes. This is only used during visualization.
The structure of the file should look something like this
chr1 0 2300000 p36.33 gneg
chr1 2300000 5400000 p36.32 gpos25
chr1 5400000 7200000 p36.31 gneg
chr1 7200000 9200000 p36.23 gpos25
chr1 9200000 12700000 p36.22 gneg
Masking File¶
The masking file is a tab delimited file which contains regions that we should ignore calls in. This can be used to filter out regions with known false positives, bad mapping, centromeres, telomeres etc. An example of the expected format is shown below. The file should have four columns: chr, start, end and name.
#chr start end name
chr1 0 2300000 centromere
chr1 9200000 12700000 telomere
The pre-built masking files in the downloads table above are telomere regions, centromere regions (based on the cytoband file), and nspan regions (computed with tools/find_repeats.py).
Masking is not required (can provide a header-only file), but is recommended as it will improve performance and specificity.
Annotations¶
This is a custom file format. It is a JSON file which contains the gene, transcript, exon, translation and protein domain positional information
Pre-built annotation files can be downloaded above. The ‘best transcript’ flag is based on an in-house model. We have also pre-built the ensembl annotations file including non-coding transcripts below.
Warning
It is worth noting that using the reference annotation file including the non-coding genes will require an increase in the default amount of memory for the annotation step due to the increased size of the annotations file. On our standard COLO829 we increased the default memory for the annotation step from 12G to 18G.
Warning
the load_reference_genes()
will
only load valid translations. If the cds sequence in the annotation is not
a multiple of CODON_SIZE
or if a
reference genome (sequences) is given and the cds start and end are not
M and * amino acids as expected the translation is not loaded
Example of the JSON file structure can be seen below
[
{
"name": string,
"start": int,
"end": int
"aliases": [string, string, ...],
"transcripts": [
{
"name": string,
"start": int,
"end": int,
"exons": [
{"start": int, "end": int, "name": string},
...
],
"cdna_coding_start": int,
"cdna_coding_end": int,
"domains": [
{
"name": string,
"regions": [
{"start" aa_start, "end": aa_end}
],
"desc": string
},
...
]
},
...
]
},
...
}
The provided files were generated with Ensembl, however it can be generated from any database with the necessary information so long as the above JSON structure is respected.
Generating the Annotations from Ensembl¶
There is a helper script included with mavis to facilitate generating the custom annotations file from an instance of the Ensembl database. This uses the Ensembl perl api to connect and pull information from the database. This has been tested with both Ensembl69 and Ensembl79.
Instructions for downloading and installing the perl api can be found on the ensembl site
- Make sure the ensembl perl api modules are added to the PERL5LIB environment variable
Also ensure that the tools directory is on the PERL5LIB path so that the TSV.pm module can be found
INSTALL_PATH=$(pwd)
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/bioperl-live
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl/modules
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl-compara/modules
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl-variation/modules
PERL5LIB=${PERL5LIB}:$HOME/ensembl_79/ensembl-funcgen/modules
# include tools/TSV.pm module
PERL5LIB=${PERL5LIB}:$INSTALL_PATH/tools
export PERL5LIB
- Run the perl script
The below instructions are shown running from inside the tools directory to avoid prefixing the script name, but it is not required to be run from here provided the above step has been executed correctly.
you can view the help menu by running
perl generate_ensembl_json.pl
you can override the default parameters (based on hard-coded defaults or environment variable content) by providing arguments to the script itself
perl generate_ensembl_json.pl --best_transcript_file /path/to/best/transcripts/file --output /path/to/output/json/file.json
or if you have configured the environment variables as given in step 2, then simply provide the output path
perl generate_ensembl_json.pl --output /path/to/output/json/file.json
DGV (Database of Genomic Variants) Annotations¶
The DGV annotations file contains regions corresponding to what is found in the database of genomic variants. This is used to annotate events that are found in healthy control samples and therefore may not be of interest if looking for somatic events.
The above (downloads table) files were generated from from DGV and reformatted to have 4 columns after download. We used awk to convert the raw file
awk '{print $2"\t"$3"\t"$4"\t"$1} GRCh37_hg19_variants_2016-05-15.txt > dgv_hg19_variants.tab
Note in hg19 the column is called “name” and in hg38 the column is called “variantaccession”. An example is shown below
#chr start end name
1 1 2300000 nsv482937
1 10001 22118 dgv1n82
1 10001 127330 nsv7879
MAVIS standard input file format¶
These requirements pertain to the columns of input files from the various tools you want to merge. The input files should be tab-delimited text files. Comments at the top of may be included. Comments should begin with two hash marks. They will be ignored when the file is read
## This is a comment
The header row contains the column names and is the first row following the comments (or the first row if no comments are included). Optionally the header row may (or may not) begin with a hash which will be stripped out on read
## This is a comment
## this is another comment
# this is the header row
A simple input file might look as follows
## File created at: 2018-01-02
## Generated by: MAVIS v1.0.0
#break1_chromosome break1_position_start break1_position_end break2_chromosome break2_position_start break2_position_end
X 1234 1234 X 77965 77965
Required Columns¶
- break1_chromosome
- break1_position_start
- break1_position_end (can be the same as break1_position_start)
- break2_chromosome
- break2_position_start
- break2_position_end (can be the same as break2_position_start)
Optional Columns¶
Optional Columns that are not given as input will be added with default (or command line parameter options) during the clustering stage of MAVIS as some are required for subsequent pipeline steps
- break1_strand (defaults to not-specified during clustering)
- break1_orientation (expanded to all possible values during clustering)
- break2_strand (defaults to not-specified during clustering)
- break2_orientation (expanded to all possible values during clustering)
- opposing_strands (expanded to all possible values during clustering)
- stranded (defaults to False during clustering)
- library (defaults to command line library parameter during clustering)
- protocol (defaults to command line protocol parameter during clustering)
- tools (defaults to an empty string during clustering)
Summary by Pipeline Step¶
The different pipeline steps of MAVIS have different input column requirements. These are summarized below (for the pipeline steps which can act as the pipeline start)
column name | cluster | annotate | validate |
---|---|---|---|
break1_chromosome | ✔ | ✔ | ✔ |
break1_position_start | ✔ | ✔ | ✔ |
break1_position_end | ✔ | ✔ | ✔ |
break2_chromosome | ✔ | ✔ | ✔ |
break2_position_start | ✔ | ✔ | ✔ |
break2_position_end | ✔ | ✔ | ✔ |
break1_strand | |||
break1_orientation | ✔ | ✔ | |
break2_strand | |||
break2_orientation | ✔ | ✔ | |
opposing_strands | |||
stranded | |||
library | |||
protocol | |||
tools | |||
event_type | ✔ |
Some native tool outputs are supported and have built in methods to convert to the above format. Any unsupported tools can be used as long as the user converts the tools native output to match the above format.
Supported Dependencies¶
MAVIS integrates with SV callers, job schedulers, and aligners. While some of these dependencies are optional, all currently supported options are detailed below. The versions column in the tables below list all the versions which were tested for each tool. Each version listed is known to be compatible with MAVIS.
Job Schedulers¶
MAVIS can be run locally without a job scheduler (MAVIS_SCHEDULER=LOCAL
) however, due to the computational resources generally required, it is recommended that you
use one of the supported schedulers listed below.
Name | Version(s) | Environment Setting |
---|---|---|
TORQUE | 6.1.2 |
MAVIS_SCHEDULER=TORQUE |
SGE | 8.1.8 |
MAVIS_SCHEDULER=SGE |
SLURM | 17.02.1-2 |
MAVIS_SCHEDULER=SLURM |
Users requiring support for other schedulers may make a request by submitting an issue to our github page. Additionally, developers looking to extend the functionality may submit a pull request (Please see the guidelines for contributors).
MAVIS running locally uses the python concurrent.futures
library to manage jobs.
Aligners¶
Two aligners are supported bwa and blat (default).
Name | Version(s) | Environment Setting |
---|---|---|
blat | 36x2 36 |
MAVIS_ALIGNER=blat |
bwa mem | 0.7.15-r1140 0.7.12 |
MAVIS_ALIGNER='bwa mem' |
Note
When setting the aligner you will also need to set the aligner_reference to match
SV Callers¶
MAVIS supports output from a wide-variety of SV callers. Assumptions are made for each tool based on interpretation of the output and the publications for each tool. The tools and versions currently supported are given below. Versions listed indicate the version of the tool for which output files have been tested as input into MAVIS
MAVIS also supports a general VCF input. It should be noted however that the tool tracked will only be listed as ‘vcf’ then.
Name | Version(s) | MAVIS input | Publication |
---|---|---|---|
BreakDancer | 1.4.5 |
Tools main output file(s) |
[Chen-2009] |
BreakSeq | 2.2 |
work/breakseq.vcf.gz |
[Abyzov-2015] |
Chimerascan | 0.4.5 |
*.bedpe |
[Iyer-2011] |
CNVnator | 0.3.3 |
Tools main output file(s) |
[Abyzov-2011] |
DeFuse | 0.6.2 |
results/results.classify.tsv |
[McPherson-2011] |
DELLY | 0.6.1 0.7.3 |
combined.vcf (converted from bcf) |
[Rausch-2012] |
Manta | 1.0.0 |
{diploidSV,somaticSV}.vcf |
[Chen-2016] |
Pindel | 0.2.5b9 |
Tools main output file(s) |
[Ye-2009] |
Trans-ABySS | 1.4.8 (custom) |
{indels/events_novel_exons,fusions/*}.tsv |
[Robertson-2010] |
Strelka | 1.0.6 |
passed.somatic.indels.vcf |
[Saunders-2012] |
STAR-Fusion | 1.4.0 |
star-fusion.fusion_predictions.abridged.tsv |
[Haas-2017] |
Note
Trans-ABySS: The trans-abyss version used was an in-house dev version. However the output columns are compatible with 1.4.8 as that was the version branched from. Additionally, although indels can be used from both genome and transcriptome outputs of Trans-ABySS, it is reccommended to only use the genome indel calls as the transcriptome indels calls (for versions tested) introduce a very high number of false positives. This will slow down validation. It is much faster to simply use the genome indels for both genome and transcriptome.
DELLY Post-processing¶
Some post-processing on the delly output files is generally done prior to input. The output BCF files are converted to a VCF file
bcftools concat -f /path/to/file/with/vcf/list --allow-overlaps --output-type v --output combined.vcf
Writing A Custom Conversion Script¶
Logic Example - Chimerascan¶
The following is a description of how the conversion script for Chimerascan was generated. While this is a built-in conversion command now, the logic could also have been put in an external script. As mentioned above, there are a number of assumptions that had to be made about the tools output to convert it to the standard mavis format. Assumptions were then verified by reviewing at a series of called events in IGV. In the current example, Chimerascan output has six columns of interest that were used in the conversion
- start3p
- end3p
- strand3p
- start5p
- end5p
- strand5p
The above columns describe two segments which are joined. MAVIS requires the position of the join. It was assumed that the segments are always joined as a sense fusion. Using this assumption there are four logical cases to determine the position of the breakpoints.
i.e. the first case would be: If both strands are positive, then the end of the five-prime segment (end5p) is the first breakpoint and the start of the three-prime segment is the second breakpoint
The logic for all cases is shown in the code below
def _parse_chimerascan(row):
"""
transforms the chimerscan output into the common format for expansion. Maps the input column
names to column names which MAVIS can read
"""
std_row = {}
for retained_column in ['genes5p', 'genes3p']:
if retained_column in row:
std_row['{}_{}'.format(SUPPORTED_TOOL.CHIMERASCAN, retained_column)] = row[retained_column]
if TRACKING_COLUMN not in row:
std_row[TRACKING_COLUMN] = '{}-{}'.format(SUPPORTED_TOOL.CHIMERASCAN, row['chimera_cluster_id'])
std_row.update({COLUMNS.break1_chromosome: row['chrom5p'], COLUMNS.break2_chromosome: row['chrom3p']})
if row['strand5p'] == '+':
std_row[COLUMNS.break1_position_start] = row['end5p']
std_row[COLUMNS.break1_orientation] = ORIENT.LEFT
else:
std_row[COLUMNS.break1_position_start] = row['start5p']
std_row[COLUMNS.break1_orientation] = ORIENT.RIGHT
if row['strand3p'] == '+':
std_row[COLUMNS.break2_position_start] = row['start3p']
std_row[COLUMNS.break2_orientation] = ORIENT.RIGHT
else:
std_row[COLUMNS.break2_position_start] = row['end3p']
std_row[COLUMNS.break2_orientation] = ORIENT.LEFT
std_row[COLUMNS.opposing_strands] = row['strand5p'] != row['strand3p']
return std_row
Calling A Custom Conversion Script¶
Custom conversion scripts can be specified during automatic config generation using the
--external_conversion
option.
Note
Any external conversion scripts must take a -o
option which requires a single
outputfile argument. This outputfile must be the converted file output by the script.
Additionally, the conversion script must be specified by its full path name and have executeable permissions.
In the following example the user has created a custom conversion script my_convert_script.py
which they
are passing an input file named my_input1.txt
.
mavis config --external_conversion my_converted_input1 "my_convert_script.py my_input1.txt ... "
This will then be called during the pipeline step as
my_convert_script.py my_input1.txt ... -o /path/to/output/dir/converted_inputs/my_converted_input1.tab
You can also re-use the same conversion script if you have multiple inputs to convert simply by specifying a different alias
mavis config \
--external_conversion my_converted_input1 "my_convert_script.py my_input1.txt" \
--external_conversion my_converted_input2 "my_convert_script.py my_input2.txt"
General VCF inputs¶
Assuming that the tool outputting the VCF file follows standard conventions, then it is possible to use a general VCF conversion that is not tool-specific. Given the wide variety in content for VCF files, MAVIS makes a number of assumptions and the VCF conversion may not work for all VCFs. In general MAVIS follows the VCF 4.2 specification. If the input tool you are using differs, it would be better to use a custom conversion script.
Assumptions on non-standard INFO fields
PRECISE
if given, Confidence intervals are ignored if given in favour of exact breakpoint calls using pos and END as the breakpoint positionsCT
values if given are representative of the breakpoint orientations.CHR2
is given for all interchromosomal events
Running the Pipeline¶
Running MAVIS using a Job Scheduler¶
The setup step of MAVIS is set up to use a Job Schedulers job scheduler on a compute cluster. will generate submission scripts and a wrapper bash script for the user to execute on their cluster head node.
The MAVIS pipeline is highly configurable. Some pipeline steps (cluster, validate) are optional and can be automatically skipped. The standard pipeline is far-left.
The most common use case is auto-generating a configuration file and then running the pipeline setup step. The pipeline setup step will run clustering and create scripts for running the other steps.
mavis config .... -w config.cfg
mavis setup config.cfg -o /path/to/top/output_dir
This will create the build.cfg configuration file, which is used by the scheduler to submit jobs. To use a particular scheduler you will need to set the MAVIS_SCHEDULER environment variable. After the build configuration file has been created you can run the mavis schedule option to submit your jobs
ssh cluster_head_node
mavis schedule -o /path/to/output_dir --submit
This will submit a series of jobs with dependencies.
Dependency graph of MAVIS jobs for the standard pipeline setup. The notation on the arrows indicates the SLURM setting on the job to add the dependency on the previous job.
Configuring Scheduler Settings¶
There are multiple ways to configure the scheduler settings. Some of the configurable options are listed below
- queue
MAVIS_QUEUE
- memory_limit
MAVIS_MEMORY_LIMIT
- time_limit
MAVIS_TIME_LIMIT
- import_env
MAVIS_IMPORT_ENV
- scheduler
MAVIS_SCHEDULER
For example to set the job queue default using an environment variable
export MAVIS_QUEUE=QUEUENAME
Or it can also be added to the config file manually
[schedule]
queue = QUEUENAME
Troubleshooting Dependency Failures¶
The most common error to occur when running MAVIS on the cluster is a memory or time limit exception. These can be detected by running the schedule step or looking for dependency failures reported on the cluster. The suffix of the job name will be a number and will correspond to the suffix of the job directory.
mavis schedule -o /path/to/output/dir
This will report any failed jobs. For example if this were a crash report for one of the validation jobs we might expect to see something like below in the schedule output
[2018-05-31 13:02:06] validate
MV_<library>_<batch id>-<task id> (<job id>) is FAILED
CRASH: <error from log file>
Any jobs in an error, failed, etc. state can be resubmitted by running mavis schedule with the resubmit flag
mavis schedule -o /path/to/output/dir --resubmit
If a job has failed due to memory or time limits, editing the /path/to/output/dir/build.cfg
file can allow the user to change a job without resetting up and rerunning the other jobs.
For example, below is the configuration for a validation job
[MV_mock-A47933_batch-D2nTiy9AhGye4UZNapAik6]
stage = validate
job_ident = 1691742
name = MV_mock-A47933_batch-D2nTiy9AhGye4UZNapAik6
dependencies =
script = /path/to/output/dir/mock-A47933_diseased_transcriptome/validate/submit.sh
status = FAILED
output_dir = /path/to/output/dir/mock-A47933_diseased_transcriptome/validate/batch-D2nTiy9AhGye4UZNapAik6-{task_ident}
stdout = /path/to/output/dir/mock-A47933_diseased_transcriptome/validate/batch-D2nTiy9AhGye4UZNapAik6-{task_ident}/job-{name}-{job_ident}-{task_ident}.log
created_at = 1527641526
status_comment =
memory_limit = 18000
queue = short
time_limit = 57600
import_env = True
mail_user =
mail_type = NONE
concurrency_limit = None
task_list = 1
2
3
The memory_limit is in Mb and the time_limit is in seconds. Editing the values here will cause the job to be resubmitted with the new values.
Warning
Incorrectly editing the build.cfg file may have unanticipated results and require re-setting up MAVIS to fix.
Generally the user should ONLY edit memory_limit
and time_limit
values.
If memory errors are frequent then it would be better to adjust the default values (trans_validation_memory, validation_memory, time_limit)
MAVIS (Mini) Tutorial¶
This tutorial is based on the data included in the tests folder of MAVIS. The data files are very small and this tutorial is really only intended for testing a MAVIS install. The data here is simulated and results are not representitive of the typical events you would see reported from MAVIS. For a more complete tutorial with actual fusion gene examples, please see the MAVIS (Full) Tutorial below.
The first step is to clone or download a zip of the MAVIS repository (https://github.com/bcgsc/mavis). You will need the tests directory. The tag you check out should correspond to the MAVIS version you have installed
git clone https://github.com/bcgsc/mavis.git
git checkout v2.0.0
mv mavis/tests .
rm -r mavis
Now you should have a folder called tests
in your current directory. You will need to specify the scheduler
if you want to test one that is not the default. For example
export MAVIS_SCHEDULER=LOCAL
Since this is a trivial example, it can easily be run locally. By default MAVIS in local mode will run a maximum of 1 less than the current cpu count processes. If you are running other things on the same machine you may find it useful to set this directly.
export MAVIS_CONCURRENCY_LIMIT=2
The above will limit mavis to running 2 processes concurrently.
Now you are ready to run MAVIS itself. This can be done in two commands (since the config file we are going to use is already built). First set up the pipeline
mavis setup tests/data/pipeline_config.cfg -o output_dir
Now if you run the schedule step (without the submit flag, schedule acts as a checker) you should see something like
mavis schedule -o output_dir/
MAVIS: 1.8.4
hostname: gphost08.bcgsc.ca
[2018-06-01 12:19:31] arguments
command = 'schedule'
log = None
log_level = 'INFO'
output = 'output_dir/'
resubmit = False
submit = False
[2018-06-01 12:19:31] validate
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 is NOT SUBMITTED
[2018-06-01 12:19:31] annotate
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 is NOT SUBMITTED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 is NOT SUBMITTED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 is NOT SUBMITTED
[2018-06-01 12:19:31] pairing
MP_batch-s4W2Go4tinn49nkhSuusrE is NOT SUBMITTED
[2018-06-01 12:19:31] summary
MS_batch-s4W2Go4tinn49nkhSuusrE is NOT SUBMITTED
rewriting: output_dir/build.cfg
Adding the submit argument will start the pipeline
mavis schedule -o output_dir/ --submit
After this completes, run schedule without the submit flag again and you should see something like
MAVIS: 1.8.4
hostname: gphost08.bcgsc.ca
[2018-06-01 13:15:28] arguments
command = 'schedule'
log = None
log_level = 'INFO'
output = 'output_dir/'
resubmit = False
submit = False
[2018-06-01 13:15:28] validate
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 (zQJYndSMimaoALwcSSiYwi) is COMPLETED
MV_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 (BHFVf3BmXVrDUA5X4GGSki) is COMPLETED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 (tUpx3iabCrpR9iKu9rJtES) is COMPLETED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 (hgmH7nqPXZ49a8yTsxSUWZ) is COMPLETED
MV_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 (cEoRN582An3eAGALaSKmpJ) is COMPLETED
[2018-06-01 13:15:28] annotate
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-1 (tMHiVR8ueNokhBDnghXYo6) is COMPLETED
MA_mock-A36971_batch-s4W2Go4tinn49nkhSuusrE-2 (AsNpNdvUyhNtKmRZqRSPpR) is COMPLETED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-1 (k7qQiAzxfC2dnZwsGH7BzD) is COMPLETED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-2 (dqAuhhcVKejDvHGBXn22xb) is COMPLETED
MA_mock-A47933_batch-s4W2Go4tinn49nkhSuusrE-3 (eB69Ghed2xAdp2VRdaCJBf) is COMPLETED
[2018-06-01 13:15:28] pairing
MP_batch-s4W2Go4tinn49nkhSuusrE (6LfEgBtBsmGhQpLQp9rXmi) is COMPLETED
[2018-06-01 13:15:28] summary
MS_batch-s4W2Go4tinn49nkhSuusrE (HDJhXgKjRmseahcQ7mgNoD) is COMPLETED
rewriting: output_dir/build.cfg
run time (hh/mm/ss): 0:00:00
run time (s): 0
If you see the above, then MAVIS has completed correctly!
MAVIS (Full) Tutorial¶
The following tutorial is an introduction to running MAVIS. You will need to download the tutorial data. Additionally the instructions pertain to running MAVIS on a SLURM cluster. This tutorial will require more resources than the MAVIS (Mini) Tutorial above.
Getting the Tutorial Data¶
The tutorial data can be downloaded from the link below. Note that it may take a while as the download is ~29GB
wget http://www.bcgsc.ca/downloads/mavis/tutorial_data.tar.gz
tar -xvzf tutorial_data.tar.gz
The expected contents are
Path | Description |
---|---|
README | Information regarding the other files in the directory |
L1522785992_expected_events.tab | The events that we expect to find, either experimentally validated or ‘spiked’ in |
L1522785992_normal.sorted.bam | Paired normal library BAM file |
L1522785992_normal.sorted.bam.bai | BAM index |
L1522785992_trans.sorted.bam | Tumour transcriptome BAM file |
L1522785992_trans.sorted.bam.bai | BAM index file |
L1522785992_tumour.sorted.bam | Tumour genome BAM file |
L1522785992_tumour.sorted.bam.bai | BAM index file |
breakdancer-1.4.5/ | Contains the BreakDancer output which was run on the tumour genome BAM file |
breakseq-2.2/ | Contains the BreakSeq output which was run on the tumour genome BAM file |
chimerascan-0.4.5/ | Contains the ChimeraScan output which was run on the tumour transcriptome BAM file |
defuse-0.6.2/ | Contains the deFuse output which was run on the tumour transcriptome BAM file |
manta-1.0.0/ | Contains the Manta output which was run on the tumour genome and paired normal genome BAM files |
Downloading the Reference Inputs¶
Run the following to download the hg19 reference files and set up the environment variables for configuring MAVIS
wget https://raw.githubusercontent.com/bcgsc/mavis/master/tools/get_hg19_reference_files.sh
bash get_hg19_reference_files.sh
source reference_inputs/hg19_env.sh
Generating the Config File¶
The config command does most of the work of creating the config for you but there are a few things you need to tell it
- Where your bams are and what library they belong to
--library L1522785992-normal genome normal False tutorial_data/L1522785992_normal.sorted.bam
--library L1522785992-tumour genome diseased False tutorial_data/L1522785992_tumour.sorted.bam
--library L1522785992-trans transcriptome diseased True tutorial_data/L1522785992_trans.sorted.bam
- Where your SV caller output files (events) are
If they are raw tool output as in the current example you will need to use the convert argument to tell MAVIS the file type
--convert breakdancer tutorial_data/breakdancer-1.4.5/*txt breakdancer
--convert breakseq tutorial_data/breakseq-2.2/breakseq.vcf.gz breakseq
--convert chimerascan tutorial_data/chimerascan-0.4.5/chimeras.bedpe chimerascan
--convert defuse tutorial_data/defuse-0.6.2/results.classify.tsv defuse
--convert manta tutorial_data/manta-1.0.0/diploidSV.vcf.gz tutorial_data/manta-1.0.0/somaticSV.vcf manta
Note
For older versions of MAVIS the convert command may require the path to the file(s) be quoted and the strandedness be specified (default is False)
- Which events you should validate in which libraries
For this example, because we want to determine which events are germline/somatic we are going to pass all genome calls to both genomes. We can use either full file paths (if the input is already in the standard format) or the alias from a conversion (the first argument given to the convert option)
--assign L1522785992-trans chimerascan defuse
--assign L1522785992-tumour breakdancer breakseq manta
--assign L1522785992-normal breakdancer breakseq manta
Putting this altogether with a name to call the config, we have the command to generate the pipeline config. You should expect this step with these inputs to take about ~5GB memory.
mavis config \
--library L1522785992-normal genome normal False tutorial_data/L1522785992_normal.sorted.bam \
--library L1522785992-tumour genome diseased False tutorial_data/L1522785992_tumour.sorted.bam \
--library L1522785992-trans transcriptome diseased True tutorial_data/L1522785992_trans.sorted.bam \
--convert breakdancer tutorial_data/breakdancer-1.4.5/*txt breakdancer \
--convert breakseq tutorial_data/breakseq-2.2/breakseq.vcf.gz breakseq \
--convert chimerascan tutorial_data/chimerascan-0.4.5/chimeras.bedpe chimerascan \
--convert defuse tutorial_data/defuse-0.6.2/results.classify.tsv defuse \
--convert manta tutorial_data/manta-1.0.0/diploidSV.vcf.gz tutorial_data/manta-1.0.0/somaticSV.vcf manta \
--assign L1522785992-trans chimerascan defuse \
--assign L1522785992-tumour breakdancer breakseq manta \
--assign L1522785992-normal breakdancer breakseq manta \
-w mavis.cfg
Setting Up the Pipeline¶
The next step is running the setup stage. This will perform conversion, clustering, and creating the submission scripts for the other stages.
mavis setup mavis.cfg -o output_dir/
At this stage you should have something that looks like this. For simplicity not all files/directories have been shown.
output_dir/
|-- build.cfg
|-- converted_inputs
| |-- breakdancer.tab
| |-- breakseq.tab
| |-- chimerascan.tab
| |-- defuse.tab
| `-- manta.tab
|-- L1522785992-normal_normal_genome
| |-- annotate
| | |-- batch-aUmErftiY7eEWvENfSeJwc-1/
| | `-- submit.sh
| |-- cluster
| | |-- batch-aUmErftiY7eEWvENfSeJwc-1.tab
| | |-- cluster_assignment.tab
| | |-- clusters.bed
| | |-- filtered_pairs.tab
| | `-- MAVIS-batch-aUmErftiY7eEWvENfSeJwc.COMPLETE
| `-- validate
| |-- batch-aUmErftiY7eEWvENfSeJwc-1/
| `-- submit.sh
|-- pairing
| `-- submit.sh
`-- summary
`-- submit.sh
Submitting Jobs to the Cluster¶
The last step is simple, ssh to your head node of your SLURM cluster (or run locally if you have configured remote_head_ssh) and run the schedule step. This will submit the jobs and create the dependency chain
ssh head_node
mavis schedule -o output_dir --submit
The schedule step also acts as a built-in checker and can be run to check for errors or if the pipeline has completed.
mavis schedule -o output_dir
This should give you output something like below (times may vary) after your run completed correctly.
MAVIS: 2.0.0
hostname: gphost08.bcgsc.ca
[2018-06-02 19:47:56] arguments
command = 'schedule'
log = None
log_level = 'INFO'
output = 'output_dir/'
resubmit = False
submit = False
[2018-06-02 19:48:01] validate
MV_L1522785992-normal_batch-aUmErftiY7eEWvENfSeJwc (1701000) is COMPLETED
200 tasks are COMPLETED
run time: 609
MV_L1522785992-tumour_batch-aUmErftiY7eEWvENfSeJwc (1701001) is COMPLETED
200 tasks are COMPLETED
run time: 669
MV_L1522785992-trans_batch-aUmErftiY7eEWvENfSeJwc (1701002) is COMPLETED
23 tasks are COMPLETED
run time: 1307
[2018-06-02 19:48:02] annotate
MA_L1522785992-normal_batch-aUmErftiY7eEWvENfSeJwc (1701003) is COMPLETED
200 tasks are COMPLETED
run time: 622
MA_L1522785992-tumour_batch-aUmErftiY7eEWvENfSeJwc (1701004) is COMPLETED
200 tasks are COMPLETED
run time: 573
MA_L1522785992-trans_batch-aUmErftiY7eEWvENfSeJwc (1701005) is COMPLETED
23 tasks are COMPLETED
run time: 537
[2018-06-02 19:48:07] pairing
MP_batch-aUmErftiY7eEWvENfSeJwc (1701006) is COMPLETED
run time: 466
[2018-06-02 19:48:07] summary
MS_batch-aUmErftiY7eEWvENfSeJwc (1701007) is COMPLETED
run time: 465
parallel run time: 3545
rewriting: output_dir/build.cfg
run time (hh/mm/ss): 0:00:11
run time (s): 11
The parallel run time reported corresponds to the sum of the slowest job for each stage and does not include any queue time etc.
Configuration and Settings¶
Pipeline Configuration File¶
The pipeline can be run in steps or it can be configured using a configuration file and setup in a single step. Scripts will be generated to run all steps following clustering. The configuration file can be built from scratch or a template can be output as shown below
>>> mavis config --write template.cfg
This will create a template config file called template.cfg which can then be edited by the user. However this will be a simple config with no library information. To generate a configuration file with the library information as well as estimates for the fragment size parameters more inputs are required (see generating the config file for more information).
Environment Variables¶
Most of the default settings can be changed by using environment variables. The value given by the environment variables will be used as the new default. Config or command-line parameters will still override these settings.
All environment variables are prefixed with MAVIS and an underscore. Otherwise the variable name is the same as that used for the command line parameter or config setting (uppercased). For example to change the default minimum mapping quality used during the validate stage
>>> export MAVIS_MIN_MAPPING_QUALITY=10
Adjusting the Resource Requirements¶
Choosing the Number of Validation/Annotation Jobs¶
MAVIS chooses the number of jobs to split validate/annotate stages into based on two settings: max_files and min_clusters_per_file.
For example, in the following situation say you have: 1000 clusters, max_files=10
, and min_clusters_per_file=10
. Then
MAVIS will set up 10 validation jobs each with 100 events.
However, if min_clusters_per_file=500
, then MAVIS would only set up 2 jobs each with 500 events. This is because
min_clusters_per_file takes precedence over max_files.
Splitting into more jobs will lower the resource requirements per job (see resource requirements). The memory and time requirements for validation are linear with respect to the number of events to be validated.
Uninformative Filter¶
For example, if the user is only interested in events in genes, then the uninformative_filter can be used. This will drop all events that are not within a certain distance (max_proximity) to any annotation in the annotations reference file. These events will be dropped prior to the validation stage which results in significant speed up.
This can be set using the environment variable
export MAVIS_UNINFORMATIVE_FILTER=True
or in the pipeline config file
[cluster]
uninformative_filter = True
or as a command line argument to the cluster stage
mavis cluster --uninformative_filter True ....
Resource Requirements¶
MAVIS has been tested on both unix and linux systems. For the standard pipeline, the validation stage is the most computationally expensive. The memory and cpu requirements will vary with two main factors: the number of structural variants you are validating per job, and the size of the bam file you are validating against.
There are a number of settings that can be adjusted to reduce memory and cpu requirements depending on what the user is trying to analyze. See configuration and settings for more details.
Validation Resources¶

Resource Requirements (MAVIS 1.8.0) for each validation job of the COLO829 tumour genome. The BAM file for the tumour genome is 127GB. Validation jobs were tested splitting into: 100, 500, 1000, and 2500 structural variant validations per job. The effect of number of events validated on both memory and time is plotted above.
Annotation Resources¶
Similar trends were observed for the annotation step (see below) with regards to time elapsed. However the memory requirements remained more constant which is expected since, unlike validation, anntotation does not read more data in for more events.
Theory and Models¶
Introduction¶
In MAVIS structural variants (SVs) are defined by a pair of breakpoints
And a breakpoint is defined by
- chromosome
- base-pair range (start, end). This has a length of 1 for exact calls and more for uncertain/non-specific calls
- orientation. This is Left or Right with respect to the positive/forward strand. This defines which portion of the genome is ‘retained’
- strand. (only applicable to stranded transcriptome libraries)
So then a breakpoint pair is any two intervals on the reference genome which are adjacent in the mutant genome
Evidence¶
There are many ways that single reads or paired-end reads can act as support for an SV call.
In the figure above the red rectangle represents a deletion structural variant. The arrows are types of single or paired-end reads supporting the event: flanking read pairs (F), split reads (S), half-mapped reads (H), and spanning reads (N).
Types of Flanking evidence¶
One of the most confusing parts about working with contig and paired-end reads is relating them to the breakpoint so that you can determine which types will support an event. The flanking read types we outline here are similarly described by IGV. We have used similar coloring for the read pairs in the following diagrams to facilitate ease of use for those already familiar with viewing bam files in IGV.
Note
The major assumptions here are that the ‘normal’ read-pair is a read pair which has one read on the positive/forward strand and its partner on the negative/reverse strand. It is assumed that partners share a read name, as is the case for Illumina reads.
Deletion¶
For a deletion, we expect the flanking reads to be in the normal orientation but that the fragment size should be abnormal (for large deletions).
Flanking read pair evidence for a deletion event. the read pairs will have a larger than expected fragment size when mapped to the reference genome because in the mutant genome they are closer together, owing to the deletion event. (B1) The first breakpoint which has a left orientation. (B2) The second breakpoint which has a right orientation. Both breakpoints would be on the positive strand (assuming that the input is stranded) which means that the first read in the pair would be on the positive strand and the second read in the pair would be on the negative/reverse strand.
Insertion¶
Flanking read pair evidence for an insertion event. The read pairs will have a smaller than expected fragment size when mapped to the reference genome because in the mutant genome they are father apart, owing to the insertion event. (B1) The first breakpoint which has a left orientation. (B2) The second breakpoint which has a right orientation. Both breakpoints would be on the positive strand (assuming that the input is stranded) which means that the first read in the pair would be on the positive strand and the second read in the pair would be on the negative/reverse strand.
Duplication¶
Flanking read pair evidence for a tandem duplication event. The read pairs will have an abnormal orientation but still the same strands as the normal read pair. (B1) The first breakpoint will be on the positive strand and have a right orientation. (B2) The second breakpoint will be on the positive strand and have a left orientation.
Inversion¶
Translocation¶
Compatible Flanking Pairs¶
For insertion and duplication events compatible flanking pairs are collected. This means that flanking pairs that support a duplication may be used as compatible flanking evidence for an insertion (in the same region) and similarly flanking pairs which support an insertion may be compatible flanking evidence for a duplication
The event depicted above may be called as either a duplication or an insertion (depending on the input call). If the even were called as a duplication the reads in green would be the flanking supoprt and the reads in blue would be given as compatible flanking support. If the event were called as an insertion the reverse would apply.
Calculating the Evidence Window¶
Basic Terms used in describing read pairs are shown above: fragment size: the distance between the pair; read length: the length of the read; fragment size: the combined length of both reads and the fragment size
We make some base assumptions with regards to paired-end read data:
Note
the distribution of fragment sizes approximately follows a normal distribution
Note
the most common fragment size is the unmutated ‘normal’ fragment
With the above assumptions we take the median fragment size to be the expected normal.
Given that we expect mutations and therefore abnormal fragment sizes, we use a modified method to calculate the median standard deviation (see code below). We calculate the squared distance away from the median for each fragment and then take a fraction of this to be ‘normal’ variation. So the most abnormal portion is ignored, assuming it is supposed to be abnormal. This results in a calculation as follows.
from statistics import median
import math
fragments = [abs(read.template_length) for read in reads] # the fragment sizes of the reads
f = 0.95 # fraction
m = median(fragments) # get the median fragment size value
X = [math.pow(i - m, 2) for i in fragments] # take the square error for each point
end = int(round(len(X) * f))
X = sorted(X)[0:end]
stdev = math.sqrt(sum(X) / len(X))
This gives us an idea of when to judge an fragment size as abnormal and where we expect our normal read pairs fragment sizes to fall.
Distribution of fragment sizes (absolute values) of proper read pairs. The black curve representings the fit for a normal distribution using the standard deviation calculated with all data points. The blue curve is the expected distribution using a 0.95 fraction of the data. The thick vertical black line is the median and the thin black lines are standard deviations away from the median.
As we can see from the diagram above, removing the outliers reproduces the observed distribution better than using all data points
We use this in two ways
- to find flanking evidence supporting deletions and insertions
- to estimate the window size for where we will need to read from the bam when looking for evidence for a given event
The _generate_window()
function uses the above concepts.
The user will define the median_fragment_size
the stdev_fragment_size
, and the stdev_count_abnormal
parameters defined in the VALIDATION_DEFAULTS
class.
If the library has a transcriptome protocol this becomes a bit more complicated and we must take into account the
possible annotations when calculating the evidence window. see
_generate_window()
for more
Calling Breakpoints by Flanking Evidence¶
Breakpoints are called by contig, split-read, or flanking pairs evidence. Contigs and split reads are used to call exact breakpoints, where breakpoints called by flanking reads are generally assigned a probabilistic range.
The metrics used here are similar to those used in calculating the evidence window. We use the max_expected_fragment_size as the outer limit of how large the range can be. This is further refined taking into account the range spanned by the flanking read pair evidence and the position of the opposing breakpoint.
Calculation of the left-oriented breakpoint by flanking reads. Reads mapped to the breakpoint are shown in grey. The read on the right (black outline, no fill) demonstrates the read length used to narrow the right side bound of the estimated breakpoint interval.
Determining Flanking support¶
After a breakpoint has been called we can narrow the interval of expected fragment sizes using the size of the event. (Left) The colored portion of the graph represents the range in fragment sizes we expect for a normal/unmutated genome. (Right) For a deletion event we expect the fragment size to be bigger, outside the normal range. Reads that would flank the breakpoint should follow a similar distribution to the normal genome but the median will be shifted by the size of the event. The shaded portion of the graph represents the range in fragment sizes we expect for flanking pairs supporting the deletion event.
Classifying Events¶
The following decision tree is used in classifying events based on their breakpoints. Only valid combinations have
been shown. see classify()
Classification Decision Tree. The above diagram details the decsion logic for classifying events based on the orientation, strand and chromosomes or their respective breakpoints
Assembling Contigs¶
During validation, for each breakpoint pair, we attempt to assemble a contig to represent the sequence across the breakpoints. This is assembled from the supporting reads (split read, half-mapped read, flanking read pair, and spanning read) which have already been collected for the given event. The sequence from each read and its reverse complement are assembled into contigs using a DeBruijn graph. For strand specific events, we then attempt to resolve the sequence strand of the contig.
Annotating Events¶
We make the following assumptions when determining the annotations for each event
Note
If both breakpoints are in the same gene, they must also be in the same transcript
Note
If the breakpoint intervals overlap we do not annotate encompassed genes
Note
Encompassed and ‘nearest’ genes are reported without respect to strand
There are specific questions we want annotation to answer. We collect gene level annotations which describes things like what gene is near the breakpoint (useful in the case of a potential promoter swap); what genes (besides the one selected) also overlap the breakpoint; what genes are encompassed between the breakpoints (for example in a deletion event the genes that would be deleted).
Gene level annotations at each breakpoint. Note: genes which fall between a breakpoint pair, encompassed genes, will not be present for interchromosomal events (translocations)
Next there are the fusion-product level annotations. If the event result in a fusion transcript, the sequence of the fusion transcript is computed. This is translated to a putative amino acid sequence from which protein metrics such as the possible ORFs and domain sequences can be computed.
Predicting Splicing Patterns¶
After the events have been called and an annotation has been attached, we often want to predict information about the putative fusion protein, which may be a product. In some cases, when a fusion transcript disrupts a splice-site, it is not clear what the processed fusion transcript may be. MAVIS will calculate all possibilities according to the following model.
For a given list of non-abrogated splice sites (listed 5’ to 3’ on the strand of the transcript) donor splice sites are paired with all following as seen below
Multiple abrogated acceptors sites. As one can see above this situation will result in 3 different splicing patterns depending on which donor is paired with the 2nd acceptor site
Multiple abrogated donor sites. As one can see above this situation will result in 3 different splicing patterns depending on which acceptor is paired with the 2nd donor site
More complex examples are drawn below. There are five classifications (SPLICE_TYPE
) for the different splicing patterns:
- Retained intron (
RETAIN
) - Skipped exon (
SKIP
) - Multiple retained introns (
MULTI_RETAIN
) - Multiple skipped exons (
MULTI_SKIP
) - Some combination of retained introns and skipped exons (
COMPLEX
)
Pairing Similar Events¶
After breakpoints have been called and annotated we often need to see if the same event was found in different samples. To do this we will need to compare events between genome and transcriptome libraries. For this, the following model is proposed. To compare events between different protocol (genome vs transcriptome) we use the annotation overlying the genome breakpoint and the splicing model we defined above to predict where we would expect to find the transcriptomic breakpoints. This gives rise to the following basic cases.
Note
In all cases the predicted breakpoint is either the same as the genomic breakpoint, or it is the same as the nearest retained donor/acceptor to the breakpoint.
(A-D) The breakpoint lands in an exon and the five prime portion of the transcript is retained. (A) The original splicing pattern showing the placement of the genomic breakpoint and the retained five prime portion. (B) The first splice site following the breakpoint is a donor and the second donor is used. (C) The first splice site following the breakpoint is a donor and the first donor is used. (D) The first splice site following the breakpoint is an acceptor. (E-H) The breakpoint lands in an exon and the three prime portion of the transcript is retained. (E) The original splicing pattern showing the placement of the genomic breakpoint and the retained three prime portion. (F) The first splice site prior to the breakpoint is an acceptor and the first acceptor is used. (G) The first splice site prior to the breakpoint is an acceptor and the second acceptor is used. (H) The first splice site prior to the breakpoint is a donor
Literature¶
[Abyzov-2011] | Abyzov,A. et al. (2011) CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res., 21, 974–984. |
[Abyzov-2015] | Abyzov,A. et al. (2015) Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms. Nat. Commun., 6, 7256. |
[Chen-2009] | Chen,K. et al. (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods, 6, 677–681. |
[Chen-2016] | Chen,X. et al. (2016) Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics, 32, 1220–1222. |
[den-Dunnen-2016] | den Dunnen,J.T. et al. (2016) HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Hum. Mutat., 37, 564–569. |
[Iyer-2011] | Iyer,M.K. et al. (2011) ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics, 27, 2903–2904. |
[MacDonald-2014] | MacDonald,J.R. et al. (2014) The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res., 42, D986–92. |
[McPherson-2011] | McPherson,A. et al. (2011) deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput. Biol., 7, e1001138. |
[Rausch-2012] | Rausch,T. et al. (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28, i333–i339. |
[Robertson-2010] | Robertson,G. et al. (2010) De novo assembly and analysis of RNA-seq data. Nat. Methods, 7, 909–912. |
[Yates-2016] | Yates,A. et al. (2016) Ensembl 2016. Nucleic Acids Res., 44, D710–D716. |
[Ye-2009] | Ye,K. et al. (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 25, 2865–2871. |
[Saunders-2012] | Saunders,C.T. et al. (2012) Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics, 28, 1811–1817. |
[Haas-2017] | Haas,B et al. (2017) STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq. doi: https://doi.org/10.1101/120295 |
Illustrations¶
Fusion Diagrams¶
These are diagrams produced during the annotate step. These represent the putative fusion events of a single breakpoint pair.
Fusion from transcriptome data. Intronic breakpoints here indicate retained intron sequence and a novel exon is predicted.
If the draw_fusions_only flag is set to False then all events will produce a diagram, even anti-sense fusions
Transcript Overlays¶
MAVIS supports generating diagrams of all transcripts for a given gene. These can be overlaid with markers and bam_file pileup data. This is particularly useful for visualizing splice site mutations.
The above diagram was generated using the overlay command
mavis overlay RB1 \
-o /path/to/output/dir \
--read_depth_plot rna /path/to/bam/file \
--marker M1 48939029 \
--annotations /path/to/mavis/annotations/reference/file
Guidelines for Contributors¶
Getting Started¶
If you are new to the project a good way to get started is by adding to the documentation, or adding unit tests where there is a lack of code coverage.
Installing for Development (new to python projects?)¶
Clone the repository and switch to the development branch
git clone https://github.com/bcgsc/mavis.git
cd mavis
git checkout develop
Set up a python virtual environment. If you are developing in python setting up with a virtual environment can be incredibly helpful as it allows for a clean install to test. Instructions for setting up the environment are below
pip install virtualenv
virtualenv venv
source venv/bin/activate
Install the MAVIS python package. Running the setup in develop mode will ensure that your code changes are run when you run MAVIS from within that virtual environment
python setup.py develop
Run the unit tests and compute code coverage
python setup.py nosetests
Make the user manual (optional)
cd docs
make html
The contents of the user manual can then be viewed by opening the build/html/index.html in any available web browser (i.e. google-chrome, firefox, etc.)
Reporting a Bug¶
Please make sure to search through the issues before reporting a bug to ensure there isn’t already an open issue.
Coding Conventions¶
Formatting/Style¶
- In general, follow pep8 style guides (except maximum line width)
- docstrings should follow sphinx google code style
- any column name which may appear in any of the intermediate or final output files must be defined in
mavis.constants.COLUMNS
Types in docstrings¶
if you want to be more explicit with nested types, the following conventions are used throughout the code
- dictionary:
d = {<key>: <value>}
becomesdict of <value> by <key>
- list:
l = [1, 2, 3]
becomeslist of int
- mixed:
d = {'a': [1, 2, 3], 'b': [4, 5, 6]}
becomesdict of list of int by str
- tuples:
('a', 1)
becomestuple of str and int
Tests¶
- all new code must have unit tests in the tests subdirectory
- in general for assertEqual statements, the expected value is given first
Major Assumptions¶
Some assumptions have been made when developing this project. The major ones have been listed here to facilitate debugging/development if any of these are violated in the future.
- The input bam reads have stored the sequence wrt to the positive/forward strand and have not stored the reverse complement.
- The distribution of the fragment sizes in the bam file approximately follows a normal distribution.
Current Limitations¶
- Assembling contigs will always fail for repeat sequences as we do not resolve this. Unlike traditional assemblies we cannot assume even input coverage as we are taking a select portion of the reads to assemble.
- Currently no attempt is made to group/pair single events into complex events.
- Transcriptome validation uses a collapsed model of all overlapping transcripts and is not isoform specific. Allowing for isoform specific validation would be computationally expensive but may be considered as an optional setting for future releases.
Computing Code coverage¶
Since MAVIS uses multiple processes, it adds complexity to computing the code coverage. Running coverage normally will undereport. To ensure that the coverage module captures the information from the subprocesses we need to do the following
In our development python virtual environment put a coverage.pth file (ex. venv/lib/python3.6/site-packages/coverage.pth
) containing the following
import coverage; coverage.process_startup()
Additionally you will need to set the environment variable
export COVERAGE_PROCESS_START=/path/to/mavis/repo/mavis/.coveragerc
Package Documentation¶
holds submodules related to structural variants
align module¶
Should take in a sam file from a aligner like bwa aln or bwa mem and convert it into a
-
mavis.align.
SUPPORTED_ALIGNER
= MavisNamespace(BLAT='blat', BWA_MEM='bwa mem')¶ supported aligners
Type: MavisNamespace
-
class
mavis.align.
SplitAlignment
(*pos, **kwargs)[source]¶ Bases:
mavis.breakpoint.BreakpointPair
-
query_consumption
()[source]¶ fraction of the query sequence which is aligned (everything not soft-clipped) in either alignment
-
-
mavis.align.
align_sequences
(sequences, input_bam_cache, reference_genome, aligner, aligner_reference, aligner_output_file='aligner_out.temp', aligner_fa_input_file='aligner_in.fa', aligner_output_log='aligner_out.log', blat_limit_top_aln=25, blat_min_identity=0.7, clean_files=True, log=<mavis.util.Log object>, **kwargs)[source]¶ calls the alignment tool and parses the return output for a set of sequences
Parameters: - sequences (dict of str to str) – dictionary of sequences by name
- input_bam_cache (BamCache) – bam cache to be used as a template for reading the alignments
- reference_genome – the reference genome
- aligner (SUPPORTED_ALIGNER) – the name of the aligner to be used
- aligner_reference (str) – path to the aligner reference file
-
mavis.align.
call_paired_read_event
(read1, read2, is_stranded=False)[source]¶ For a given pair of reads call all applicable events. Assume there is a major event from both reads and then call indels from the individual reads
-
mavis.align.
call_read_events
(read, secondary_read=None, is_stranded=False)[source]¶ Given a read, return breakpoint pairs representing all putative events
-
mavis.align.
convert_to_duplication
(alignment, reference_genome)[source]¶ Given a breakpoint call, tests if the untemplated sequences matches the preceding reference sequence. If it does this is annotated as a duplication and the new breakpoint pair is returned. If not, then the original breakpoint pair is returned
-
mavis.align.
get_aligner_version
(aligner)[source]¶ executes a subprocess to try and run the aligner without arguments and parse the version number from the output
Example
>>> get_aligner_version('blat') '36x2'
annotate subpackage¶
Sub-package Documentation¶
Types of Output Files¶
expected name/suffix | file type/format | content |
---|---|---|
annotations.tab |
text/tabbed | annotated events |
annotations.fusion-cdna.fa |
fasta | putative fusion unspliced cDNA sequences |
drawings/*.svg |
SVG | diagrams |
drawings/*.legend.json |
JSON | diagram legend/metadata |
Algorithm Overview¶
see theory - annotating events
- read in breakpoint pairs
- generate strand-specific annotations (one annotation per strand, multiple if multiple genes/transcripts in the region)
- try building fusion transcripts for bp-specific calls
- generate SVG diagrams
Levels of Annotations¶
Overview of Class Relationships¶
The Annotation sub-package has objects for genetic annotations and related calculations. The basic layout of the
package is shown above. IS-A relationships are given by the blue arrows. HAS-A relationships are shown in black.
And reference_object/parent
type relationships are shown in red. Gene
is a gene. Start and end are
genomic positions wrt to the template/chr. PreTranscript
is the
unspliced transcript. Start and end are genomic positions wrt to the template/chr.
Transcript
: is the spliced transcript. Start and end coordinates are
1 to the length of the spliced product in base pairs.
Translation
: is the translation of the spliced transcript. Start and
end are cdna positions wrt the 5’ end of the spliced transcript. The start and end here describe the start and end
of the coding sequence
modules
base module¶
-
class
mavis.annotate.base.
BioInterval
(reference_object, start, end=None, name=None, seq=None, data=None, strand=None)[source]¶ Bases:
object
Parameters: Example
>>> b = BioInterval('1', 12572784, 12578898, 'q22.2') >>> b[0] 12572784 >>> b[1] 12578898
-
end
¶ the end position
Type: int
-
get_chr
()[source]¶ pulls chromosome information from the current object, or follows reference objects until the chromosome is found
Returns: the chromosome of this or any of its reference objects Return type: str Raises: AttributeError
– raised if the chromosome is not set on this or any of its reference objects
-
get_seq
(reference_genome=None, ignore_cache=False)[source]¶ get the sequence for the current annotation object
Raises: NotImplementedError
– abstract method
-
get_strand
()[source]¶ pulls strand information from the current object, or follows reference objects until the strand is found
Returns: the strand of this or any of its reference objects Return type: STRAND Raises: AttributeError
– raised if the strand is not set on this or any of its reference objects
-
is_reverse
¶ True if the gene is on the reverse/negative strand.
Raises: AttributeError
– if the strand is not specified
-
key
()[source]¶ tuple
: a tuple representing the items expected to be unique. for hashing and comparing
-
start
¶ the start position
Type: int
-
constants module¶
-
mavis.annotate.constants.
DEFAULTS
= WeakMavisNamespace(annotation_filters='choose_more_annotated,choose_transcripts_by_priority', draw_fusions_only=True, draw_non_synonymous_cdna_only=True, max_orf_cap=3, min_domain_mapping_match=0.9, min_orf_size=300)¶
-
mavis.annotate.constants.
SPLICE_SITE_RADIUS
= 2¶ number of bases away from an exon boundary considered to be part of the splice site such that if it were altered the splice site would be considered to be abrogated.
Type: int
-
mavis.annotate.constants.
SPLICE_TYPE
= MavisNamespace(COMPLEX='complex', MULTI_RETAIN='retained multiple introns', MULTI_SKIP='skipped multiple exons', NORMAL='normal', RETAIN='retained intron', SKIP='skipped exon')¶ holds controlled vocabulary for allowed splice type classification values
RETAIN
: an intron was retainedSKIP
: an exon was skippedNORMAL
: no exons were skipped and no introns were retained. the normal/expected splicing pattern was followedMULTI_RETAIN
: multiple introns were retainedMULTI_SKIP
: multiple exons were skippedCOMPLEX
: some combination of exon skipping and intron retention
Type: MavisNamespace
file_io module¶
module which holds all functions relating to loading reference files
-
class
mavis.annotate.file_io.
ReferenceFile
(file_type, *filepaths, eager_load=False, assert_exists=False, **opt)[source]¶ Bases:
object
Parameters: - Raises
- FileNotFoundError: when assert_exists and an input does not exist
-
CACHE
= {}¶
-
LOAD_FUNCTIONS
= {'aligner_reference': None, 'annotations': <function load_annotations>, 'dgv_annotation': <function load_masking_regions>, 'masking': <function load_masking_regions>, 'reference_genome': <function load_reference_genome>, 'template_metadata': <function load_templates>}¶ Mapping of file types (based on ENV name) to load functions
Type: dict
-
mavis.annotate.file_io.
convert_tab_to_json
(filepath, warn=<mavis.util.Log object>)[source]¶ given a file in the std input format (see below) reads and return a list of genes (and sub-objects)
column name example description ensembl_transcript_id ENST000001 ensembl_gene_id ENSG000001 strand -1 positive or negative 1 cdna_coding_start 44 where translation begins relative to the start of the cdna cdna_coding_end 150 where translation terminates genomic_exon_ranges 100-201;334-412;779-830 semi-colon demitited exon start/ends AA_domain_ranges DBD:220-251,260-271 semi-colon delimited list of domains hugo_names KRAS hugo gene name Parameters: filepath (str) – path to the input tab-delimited file Returns: a dictionary keyed by chromosome name with values of list of genes on the chromosome Return type: dict
oflist
ofGene
bystr
Example
>>> ref = load_reference_genes('filename') >>> ref['1'] [Gene(), Gene(), ....]
Warning
does not load translations unless then start with ‘M’, end with ‘*’ and have a length of multiple 3
-
mavis.annotate.file_io.
load_annotations
(*filepaths, warn=<mavis.util.Log object>, reference_genome=None, best_transcripts_only=False)[source]¶ loads gene models from an input file. Expects a tabbed or json file.
Parameters: - filepath (str) – path to the input file
- verbose (bool) – output extra information to stdout
- reference_genome (
dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr name - filetype (str) – json or tab/tsv. only required if the file type can’t be interpolated from the path extension
Returns: lists of genes keyed by chromosome name
Return type:
-
mavis.annotate.file_io.
load_masking_regions
(*filepaths)[source]¶ reads a file of regions. The expect input format for the file is tab-delimited and the header should contain the following columns
- chr: the chromosome
- start: start of the region, 1-based inclusive
- end: end of the region, 1-based inclusive
- name: the name/label of the region
For example:
#chr start end name chr20 25600000 27500000 centromere
Parameters: filepath (str) – path to the input tab-delimited file Returns: a dictionary keyed by chromosome name with values of lists of regions on the chromosome Return type: dict
oflist
ofBioInterval
bystr
Example
>>> m = load_masking_regions('filename') >>> m['1'] [BioInterval(), BioInterval(), ...]
-
mavis.annotate.file_io.
load_reference_genes
(*pos, **kwargs)[source]¶ Deprecated Use
load_annotations()
instead
-
mavis.annotate.file_io.
load_reference_genome
(*filepaths)[source]¶ Parameters: filepaths (list of str) – the paths to the files containing the input fasta genomes Returns: a dictionary representing the sequences in the fasta file Return type: dict
ofBio.SeqRecord
bystr
-
mavis.annotate.file_io.
load_templates
(*filepaths)[source]¶ primarily useful if template drawings are required and is not necessary otherwise assumes the input file is 0-indexed with [start,end) style. Columns are expected in the following order, tab-delimited. A header should not be given
- name
- start
- end
- band_name
- giemsa_stain
for example
chr1 0 2300000 p36.33 gneg chr1 2300000 5400000 p36.32 gpos25
Parameters: filename (str) – the path to the file with the cytoband template information Returns: list of the templates loaded Return type: list
ofTemplate
fusion module¶
-
class
mavis.annotate.fusion.
FusionTranscript
[source]¶ Bases:
mavis.annotate.genomic.PreTranscript
FusionTranscript is a PreTranscript built from two parent PreTranscripts. It has most of the same functionality as a regular PreTranscript except that it will not have a parent gene and retains a mapping of the new exons to the exons in the PreTranscript they originated from
Additionally the FusionTranscript is always constructed on the positive strand.
The preferred way to construct a FusionTranscript is through the build method.
-
classmethod
build
(ann, reference_genome, min_orf_size=None, max_orf_cap=None, min_domain_mapping_match=None)[source]¶ Parameters: - ann (Annotation) – the annotation object we want to build a FusionTranscript for
- reference_genome (
dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr name
Returns: the newly built fusion transcript
Return type:
-
exon_number
(exon)[source]¶ Parameters: exon (Exon) – the exon to be numbered Returns: the number of the exon in the original transcript (prior to fusion) Return type: int
-
get_cdna_seq
(splicing_pattern, reference_genome=None, ignore_cache=False)[source]¶ Parameters: Returns: the spliced cDNA seq
Return type:
-
classmethod
-
mavis.annotate.fusion.
determine_prime
(transcript, breakpoint)[source]¶ determine the side of the transcript 5’ or 3’ which is ‘kept’ given the breakpoint
Parameters: - transcript (Transcript) – the transcript
- breakpoint (Breakpoint) – the breakpoint
Returns: 5’ or 3’
Return type: PRIME
Raises: AttributeError
– if the orientation of the breakpoint or the strand of the transcript is not specified
genomic module¶
-
class
mavis.annotate.genomic.
Exon
(start, end, transcript=None, name=None, intact_start_splice=True, intact_end_splice=True, seq=None, strand=None)[source]¶ Bases:
mavis.annotate.base.BioInterval
Parameters: - start (int) – the genomic start position
- end (int) – the genomic end position
- name (str) – the name of the exon
- transcript (PreTranscript) – the ‘parent’ transcript this exon belongs to
- intact_start_splice (bool) – if the starting splice site has been abrogated
- intact_end_splice (bool) – if the end splice site has been abrogated
Raises: AttributeError
– if the exon start > the exon endExample
>>> Exon(15, 78)
-
acceptor
¶ returns the genomic exonic position of the acceptor splice site
Type: int
-
donor
¶ returns the genomic exonic position of the donor splice site
Type: int
-
transcript
¶ the transcript this exon belongs to
Type: PreTranscript
-
class
mavis.annotate.genomic.
Gene
(chr, start, end, name=None, strand='?', aliases=None, seq=None)[source]¶ Bases:
mavis.annotate.base.BioInterval
Parameters: Example
>>> Gene('X', 1, 1000, 'ENG0001', '+', ['KRAS'])
-
chr
¶ returns the name of the chromosome that this gene resides on
-
get_seq
(reference_genome, ignore_cache=False)[source]¶ gene sequence is always given wrt to the positive forward strand regardless of gene strand
Parameters: Returns: the sequence of the gene
Return type:
-
spliced_transcripts
¶ list of transcripts
Type: list
ofTranscript
-
transcript_priority
(transcript)[source]¶ prioritizes transcripts from 0 to n-1 based on best transcript flag and then alphanumeric name sort
Warning
Lower number means higher priority. This is to make sort work by default
-
transcripts
¶ list of unspliced transcripts
Type: list
ofPreTranscript
-
translations
¶ list of translations
Type: list
ofTranslation
-
-
class
mavis.annotate.genomic.
IntergenicRegion
(chr, start, end, strand)[source]¶ Bases:
mavis.annotate.base.BioInterval
Parameters: Example
>>> IntergenicRegion('1', 1, 100, '+')
-
chr
¶ returns the name of the chromosome that this region resides on
-
-
class
mavis.annotate.genomic.
PreTranscript
(exons, gene=None, name=None, strand=None, spliced_transcripts=None, seq=None, is_best_transcript=False)[source]¶ Bases:
mavis.annotate.base.BioInterval
creates a new transcript object
Parameters: - exons (
list
ofExon
) – list of Exon that make up the transcript - genomic_start (int) – genomic start position of the transcript
- genomic_end (int) – genomic end position of the transcript
- gene (Gene) – the gene this transcript belongs to
- name (str) – name of the transcript
- strand (STRAND) – strand the transcript is on, defaults to the strand of the Gene if not specified
- seq (str) – unspliced cDNA seq
-
convert_cdna_to_genomic
(pos, splicing_pattern)[source]¶ Parameters: - pos (int) – cdna position
- splicing_pattern (SplicingPattern) – list of genomic splice sites 3‘5’ repeating
Returns: the genomic equivalent
Return type:
-
convert_genomic_to_cdna
(pos, splicing_pattern)[source]¶ Parameters: - pos (int) – the genomic position to be converted
- splicing_pattern (SplicingPattern) – list of genomic splice sites 3‘5’ repeating
Returns: the cdna equivalent
Return type: Raises: IndexError
– when a genomic position not present in the cdna is attempted to be converted
-
convert_genomic_to_nearest_cdna
(pos, splicing_pattern, stick_direction=None, allow_outside=True)[source]¶ converts a genomic position to its cdna equivalent or (if intronic) the nearest cdna and shift
Parameters: - pos (int) – the genomic position
- splicing_pattern (SplicingPattern) – the splicing pattern
Returns: - int - the exonic cdna position
- int - the intronic shift
Return type: tuple of int and int
-
exon_number
(exon)[source]¶ exon numbering is based on the direction of translation
Parameters: exon (Exon) – the exon to be numbered Returns: the exon number (1 based) Return type: int Raises: AttributeError
– if the strand is not given or the exon does not belong to the transcript
-
generate_splicing_patterns
()[source]¶ returns a list of splice sites to be connected as a splicing pattern
Returns: List of positions to be spliced together Return type: list
ofSplicingPattern
-
get_cdna_seq
(splicing_pattern, reference_genome=None, ignore_cache=False)[source]¶ Parameters: - splicing_pattern (SplicingPattern) – the list of splicing positions
- reference_genome (
dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr name - ignore_cache (bool) – if True then stored sequences will be ignored and the function will attempt to retrieve the sequence using the positions and the input reference_genome
Returns: the spliced cDNA sequence
Return type:
-
get_seq
(reference_genome=None, ignore_cache=False)[source]¶ Parameters: Returns: the sequence of the transcript including introns (but relative to strand)
Return type:
-
transcripts
¶ list of spliced transcripts
Type: list
ofTranscript
-
translations
¶ list of translations associated with this transcript
Type: list
ofTranslation
- exons (
-
class
mavis.annotate.genomic.
Transcript
(pre_transcript, splicing_patt, seq=None, translations=None)[source]¶ Bases:
mavis.annotate.base.BioInterval
splicing pattern is given in genomic coordinates
Parameters: - pre_transcript (PreTranscript) – the unspliced transcript
- splicing_patt (
list
ofint
) – the list of splicing positions - seq (str) – the cdna sequence
- translations (
list
ofTranslation
) – the list of translations of this transcript
-
convert_cdna_to_genomic
(pos)[source]¶ Parameters: pos (int) – cdna position Returns: the genomic equivalent Return type: int
-
convert_genomic_to_cdna
(pos)[source]¶ Parameters: pos (int) – the genomic position to be converted Returns: the cdna equivalent Return type: int Raises: IndexError
– when a genomic position not present in the cdna is attempted to be converted
-
get_seq
(reference_genome=None, ignore_cache=False)[source]¶ Parameters: Returns: the sequence corresponding to the spliced cdna
Return type:
-
unspliced_transcript
¶ the unspliced transcript this splice variant belongs to
Type: PreTranscript
main module¶
-
mavis.annotate.main.
draw
(drawing_config, ann, reference_genome, template_metadata, drawings_directory)[source]¶ produces the svg diagram and json legend for a given annotation
-
mavis.annotate.main.
main
(inputs, output, library, protocol, reference_genome, annotations, template_metadata, min_domain_mapping_match=0.9, min_orf_size=300, max_orf_cap=3, annotation_filters='choose_more_annotated, choose_transcripts_by_priority', start_time=1558720112, draw_fusions_only=True, draw_non_synonymous_cdna_only=True, max_proximity=5000, **kwargs)[source]¶ Parameters: - inputs (
List
ofstr
) – list of input files to read - output (str) – path to the output directory
- reference_genome (
ReferenceFile
) – seeload_reference_genome()
- annotations (
ReferenceFile
) – seeload_reference_genes()
- template_metadata (
ReferenceFile
) – seeload_templates()
- min_domain_mapping_match (float) – min mapping match percent (0-1) to count a domain as mapped
- min_orf_size (int) – minimum size of an open reading frame to keep as a putative translation
- max_orf_cap (int) – the maximum number of open reading frame s to collect for any given event
- inputs (
protein module¶
-
class
mavis.annotate.protein.
Domain
(name, regions, translation=None, data=None)[source]¶ Bases:
object
Parameters: - name (str) – the name of the domain i.e. PF00876
- regions (
list
ofDomainRegion
) – the amino acid ranges that are part of the domain - transcript (Transcript) – the ‘parent’ transcript this domain belongs to
Raises: AttributeError
– if the end of any region is less than the startExample
>>> Domain('DNA binding domain', [(1, 4), (10, 24)], transcript)
-
align_seq
(input_sequence, reference_genome=None, min_region_match=0.5)[source]¶ align each region to the input sequence starting with the last one. then take the subset of sequence that remains to align the second last and so on return a list of intervals for the alignment. If multiple alignments are found, then raise an error
Parameters: Returns: tuple contains
- int: the number of matches
- int: the total number of amino acids to be aligned
list
ofDomainRegion
: the list of domain regions on the new input sequence
Return type: Raises: AttributeError
– if sequence information is not availableUserWarning
– if a valid alignment could not be found or no best alignment was found
-
get_seqs
(reference_genome=None, ignore_cache=False)[source]¶ returns the amino acid sequences for each of the domain regions associated with this domain in the order of the regions (sorted by start)
Parameters: reference_genome ( dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr nameReturns: list of amino acid sequences for each DomainRegion Return type: list
ofstr
Raises: AttributeError
– if there is not enough sequence information given to determine this
-
key
()[source]¶ tuple
: a tuple representing the items expected to be unique. for hashing and comparing
-
score_region_mapping
(reference_genome=None)[source]¶ compares the sequence in each DomainRegion to the sequence collected for that domain region from the translation object
Parameters: reference_genome ( dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr nameReturns: tuple contains - int: the number of matching amino acids
- int: the total number of amino acids
Return type: tuple of int and int
-
translation
¶ the Translation this domain belongs to
Type: Translation
-
class
mavis.annotate.protein.
Translation
(start, end, transcript=None, domains=None, seq=None, name=None)[source]¶ Bases:
mavis.annotate.base.BioInterval
describes the splicing pattern and cds start and end with reference to a particular transcript
Parameters: - start (int) – start of the coding sequence (cds) relative to the start of the first exon in the transcript
- end (int) – end of the coding sequence (cds) relative to the start of the first exon in the transcript
- transcript (Transcript) – the transcript this is a Translation of
- domains (
list
ofDomain
) – a list of the domains on this translation - sequence (str) – the cds sequence
-
convert_aa_to_cdna
(pos)[source]¶ Parameters: pos (int) – the amino acid position Returns: the cdna equivalent (with CODON_SIZE uncertainty) Return type: Interval
-
convert_cdna_to_aa
(pos)[source]¶ Parameters: pos (int) – the cdna position Returns: the protein/amino-acid position Return type: int Raises: AttributeError
– the cdna position is not translated
-
convert_genomic_to_cds
(pos)[source]¶ converts a genomic position to its cds (coding sequence) equivalent
Parameters: pos (int) – the genomic position Returns: the cds position (negative if before the initiation start site) Return type: int
-
convert_genomic_to_cds_notation
(pos)[source]¶ converts a genomic position to its cds (coding sequence) equivalent using hgvs cds notation
Parameters: pos (int) – the genomic position Returns: the cds position notation Return type: str Example
>>> tl = Translation(...) # a position before the translation start >>> tl.convert_genomic_to_cds_notation(1010) '-50' # a position after the translation end >>> tl.convert_genomic_to_cds_notation(2031) '*72' # an intronic position >>> tl.convert_genomic_to_cds_notation(1542) '50+10' >>> tl.convert_genomic_to_cds_notation(1589) '51-14'
-
convert_genomic_to_nearest_cds
(pos)[source]¶ converts a genomic position to its cds equivalent or (if intronic) the nearest cds and shift
Parameters: pos (int) – the genomic position Returns: - int - the cds position
- int - the intronic shift
Return type: tuple of int and int
-
get_aa_seq
(reference_genome=None, ignore_cache=False)[source]¶ Parameters: reference_genome ( dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr nameReturns: the amino acid sequence Return type: str Raises: AttributeError
– if the reference sequence has not been given and is not set
-
get_cds_seq
(reference_genome=None, ignore_cache=False)[source]¶ Parameters: reference_genome ( dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr nameReturns: the cds sequence Return type: str Raises: AttributeError
– if the reference sequence has not been given and is not set
-
get_seq
(reference_genome=None, ignore_cache=False)[source]¶ wrapper for the sequence method
Parameters: reference_genome ( dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr name
-
transcript
¶ the spliced transcript this translation belongs to
Type: Transcript
-
mavis.annotate.protein.
calculate_orf
(spliced_cdna_sequence, min_orf_size=None)[source]¶ calculate all possible open reading frames given a spliced cdna sequence (no introns)
Parameters: spliced_cdna_sequence (str) – the sequence Returns: list of open reading frame positions on the input sequence Return type: list
ofInterval
splicing module¶
-
class
mavis.annotate.splicing.
SpliceSite
(ref, pos, site_type, intact=True, start=None, end=None, strand=None, seq=None)[source]¶
-
class
mavis.annotate.splicing.
SplicingPattern
(*args, splice_type='normal')[source]¶ Bases:
list
-
classmethod
generate_patterns
(sites, is_reverse=False)[source]¶ returns a list of splice sites to be connected as a splicing pattern
Returns: List of positions to be spliced together Return type: list
ofSplicingPattern
-
classmethod
variant module¶
-
class
mavis.annotate.variant.
Annotation
(bpp, transcript1=None, transcript2=None, proximity=5000, data=None, **kwargs)[source]¶ Bases:
mavis.breakpoint.BreakpointPair
a fusion of two transcripts created by the associated breakpoint_pair will also hold the other annotations for overlapping and encompassed and nearest genes
Holds a breakpoint call and a set of transcripts, other information is gathered relative to these
Parameters: - bpp (BreakpointPair) – the breakpoint pair call. Will be adjusted and then stored based on the transcripts
- transcript1 (Transcript) – transcript at the first breakpoint
- transcript2 (Transcript) – Transcript at the second breakpoint
- data (dict) – optional dictionary to hold related attributes
- event_type (SVTYPE) – the type of event
-
add_gene
(input_gene)[source]¶ adds a input_gene to the current set of annotations. Checks which set it should be added to
Parameters: input_gene (input_gene) – the input_gene being added
-
class
mavis.annotate.variant.
IndelCall
(refseq, mutseq)[source]¶ Bases:
object
Given two sequences, Assuming there exists a single difference between the two call an indel which accounts for the change
Parameters:
-
mavis.annotate.variant.
annotate_events
(bpps, annotations, reference_genome, max_proximity=5000, min_orf_size=200, min_domain_mapping_match=0.95, max_orf_cap=3, log=<mavis.util.Log object>, filters=None)[source]¶ Parameters: - bpps (list of
BreakpointPair
) – list of events - annotations – reference annotations
- reference_genome (dict of string by string) – dictionary of reference sequences by name
- max_proximity (int) – see max_proximity
- min_orf_size (int) – see min_orf_size
- min_domain_mapping_match (float) – see min_domain_mapping_match
- max_orf_cap (int) – see max_orf_cap
- log (callable) – callable function to take in strings and time_stamp args
- filters (list of callable) – list of functions taking in a list and returning a list for filtering
Returns: list of the putative annotations
Return type: list of
Annotation
- bpps (list of
-
mavis.annotate.variant.
call_protein_indel
(ref_translation, fusion_translation, reference_genome=None)[source]¶ compare the fusion protein/aa sequence to the reference protein/aa sequence and return an hgvs notation indel call
Parameters: - ref_translation (Translation) – the reference protein/translation
- fusion_translation (Translation) – the fusion protein/translation
- reference_genome – the reference genome object used to fetch the reference translation AA sequence
Returns: the HGVS protein indel notation
Return type:
-
mavis.annotate.variant.
choose_more_annotated
(ann_list)[source]¶ for a given set of annotations if there are annotations which contain transcripts and annotations that are simply intergenic regions, discard the intergenic region annotations
similarly if there are annotations where both breakpoints fall in a transcript and annotations where one or more breakpoints lands in an intergenic region, discard those that land in the intergenic region
Parameters: ann_list (list of Annotation
) – list of input annotationsWarning
input annotations are assumed to be the same event (the same validation_id) the logic used would not apply to different events
Returns: the filtered list Return type: list of Annotation
-
mavis.annotate.variant.
choose_transcripts_by_priority
(ann_list)[source]¶ for each set of annotations with the same combinations of genes, choose the annotation with the most “best_transcripts” or most “alphanumeric” choices of transcript. Throw an error if they are identical
Parameters: ann_list (list of Annotation
) – input annotationsWarning
input annotations are assumed to be the same event (the same validation_id) the logic used would not apply to different events
Returns: the filtered list Return type: list of Annotation
-
mavis.annotate.variant.
flatten_fusion_translation
(translation)[source]¶ for a given fusion product (translation) gather the information to be output to the tabbed files
Parameters: translation (Translation) – the translation which is on the fusion transcript Returns: the dictionary of column names to values Return type: dict
assemble module¶
-
class
mavis.assemble.
Contig
(sequence, score)[source]¶ Bases:
object
-
class
mavis.assemble.
DeBruijnGraph
(data=None, **attr)[source]¶ Bases:
networkx.classes.digraph.DiGraph
wrapper for a basic digraph enforces edge weights
Initialize a graph with edges, name, graph attributes.
Parameters: - data (input graph) – Data to initialize graph. If data=None (default) an empty graph is created. The data can be an edge list, or any NetworkX graph object. If the corresponding optional Python packages are installed the data can also be a NumPy matrix or 2d ndarray, a SciPy sparse matrix, or a PyGraphviz graph.
- name (string, optional (default='')) – An optional name for the graph.
- attr (keyword arguments, optional (default= no attributes)) – Attributes to add to graph as key=value pairs.
See also
convert
Examples
>>> G = nx.Graph() # or DiGraph, MultiGraph, MultiDiGraph, etc >>> G = nx.Graph(name='my graph') >>> e = [(1,2),(2,3),(3,4)] # list of edges >>> G = nx.Graph(e)
Arbitrary graph attribute pairs (key=value) may be assigned
>>> G=nx.Graph(e, day="Friday") >>> G.graph {'day': 'Friday'}
-
add_edge
(n1, n2, freq=1)[source]¶ add a given edge to the graph, if it exists add the frequency to the existing frequency count
-
trim_forks_by_freq
(min_weight)[source]¶ for all nodes in the graph, if the node has an out-degree > 1 and one of the outgoing edges has freq < min_weight. then that outgoing edge is deleted
-
mavis.assemble.
assemble
(sequences, kmer_size, min_edge_trim_weight=3, assembly_max_paths=20, assembly_min_uniq=0.01, min_complexity=0, log=<function <lambda>>, **kwargs)[source]¶ for a set of sequences creates a DeBruijnGraph simplifies trailing and leading paths where edges fall below a weight threshold and the return all possible unitigs/contigs
drops any sequences too small to fit the kmer size
Parameters: - sequences (
list
ofstr
) – a list of strings/sequences to assemble - kmer_size – see assembly_kmer_size the size of the kmer to use
- min_edge_trim_weight – see assembly_min_edge_trim_weight
- remap_min_match – Minimum match percentage of the remapped read (based on the exact matches in the cigar)
- remap_min_overlap – defaults to the kmer size. Minimum amount of overlap between the contig and the remapped read
- min_contig_length – Minimum length of contigs assemble to attempt remapping reads to. Shorter contigs will be ignored
- remap_min_exact_match – see assembly_min_exact_match_to_remap
- assembly_max_paths – see assembly_max_paths
- log (function) – the log function
Returns: a list of putative contigs
Return type: - sequences (
-
mavis.assemble.
digraph_connected_components
(graph, subgraph=None)[source]¶ the networkx module does not support deriving connected components from digraphs (only simple graphs) this function assumes that connection != reachable this means there is no difference between connected components in a simple graph and a digraph
Parameters: graph (networkx.DiGraph) – the input graph to gather components from Returns: returns a list of compnents which are lists of node names Return type: list
oflist
-
mavis.assemble.
filter_contigs
(contigs, assembly_min_uniq=0.01)[source]¶ given a list of contigs, removes similar contigs to leave the highest (of the similar) scoring contig only
-
mavis.assemble.
kmers
(s, size)[source]¶ for a sequence, compute and return a list of all kmers of a specified size
Parameters: Returns: the list of kmers
Return type: Example
>>> kmers('abcdef', 2) ['ab', 'bc', 'cd', 'de', 'ef']
-
mavis.assemble.
pull_contigs_from_component
(assembly, component, min_edge_trim_weight, assembly_max_paths, log=<mavis.util.Log object>)[source]¶ builds contigs from the a connected component of the assembly DeBruijn graph
Parameters: - assembly (DeBruijnGraph) – the assembly graph
- component (list) – list of nodes which make up the connected component
- min_edge_trim_weight (int) – the minimum weight to not remove a non cutting edge/path
- assembly_max_paths (int) – the maximum number of paths allowed before the graph is further simplified
- log (function) – the log function
Returns: the paths/contigs and their scores
Return type:
bam subpackage¶
—
cache module¶
-
class
mavis.bam.cache.
BamCache
(bamfile, stranded=False)[source]¶ Bases:
object
caches reads by name to facilitate getting read mates without jumping around the file if we’ve already read that section
Parameters: bamfile (str) – path to the input bam file -
add_read
(read)[source]¶ Parameters: read (pysam.AlignedSegment) – the read to add to the cache
-
fetch
(input_chrom, start, stop, limit=10000, cache_if=<function BamCache.<lambda>>, filter_if=<function BamCache.<lambda>>, stop_on_cached_read=False)[source]¶ Parameters: - input_chrom (str) – chromosome name
- start (int) – start position
- end (int) – end position
- limit (int) – maximum number of reads to fetch
- cache_if (function) – if returns True then the read is added to the cache
- filter_if (function) – if returns True then the read is not returned as part of the result
- stop_on_cached_read (bool) – stop reading at the first read found that is already in the cache
Note
the cache_if and filter_if functions must be any function that takes a read as input and returns a boolean
Returns: a set of reads which overlap the input region Return type: set of pysam.AlignedSegment
-
fetch_from_bins
(input_chrom, start, stop, read_limit=10000, cache=False, sample_bins=3, cache_if=<function BamCache.<lambda>>, min_bin_size=10, filter_if=<function BamCache.<lambda>>)[source]¶ wrapper around the fetch method, returns a list to avoid errors with changing the file pointer position from within the loop. Also caches reads if requested and can return a limited read number
Parameters: - chrom (str) – the chromosome
- start (int) – the start position
- stop (int) – the end position
- read_limit (int) – the maximum number of reads to parse
- cache (bool) – flag to store reads
- sample_bins (int) – number of bins to split the region into
- cache_if (callable) – function to check to against a read to determine if it should be cached
- bin_gap_size (int) – gap between the bins for the fetch area
Returns: set of reads gathered from the region
Return type:
-
get_mate
(read, primary_only=True, allow_file_access=False)[source]¶ Parameters: - read (pysam.AlignedSegment) – the read
- primary_only (bool) – ignore secondary alignments
- allow_file_access (bool) – determines if the bam can be accessed to try to find the mate
Returns: list of mates of the input read
Return type:
-
get_read_reference_name
(read)[source]¶ Parameters: read (pysam.AlignedSegment) – the read we want the chromosome name for Returns: the name of the chromosome Return type: str
-
cigar module¶
holds methods related to processing cigar tuples. Cigar tuples are generally an iterable list of tuples where the first element in each tuple is the CIGAR value (i.e. 1 for an insertion), and the second value is the frequency
-
mavis.bam.cigar.
alignment_matches
(cigar)[source]¶ counts the number of aligned bases irrespective of match/mismatch this is equivalent to counting all CIGAR.M
-
mavis.bam.cigar.
compute
(ref, alt, force_softclipping=True, min_exact_to_stop_softclipping=6)[source]¶ given a ref and alt sequence compute the cigar string representing the alt
returns the cigar tuples along with the start position of the alt relative to the ref
-
mavis.bam.cigar.
convert_for_igv
(cigar)[source]¶ igv does not support the extended CIGAR values for match v mismatch
Example
>>> convert_for_igv([(7, 4), (8, 1), (7, 5)]) [(0, 10)]
-
mavis.bam.cigar.
convert_string_to_cigar
(string)[source]¶ Given a cigar string, converts it to the appropriate cigar tuple
Example
>>> convert_string_to_cigar('8M2I1D9X') [(CIGAR.M, 8), (CIGAR.I, 2), (CIGAR.D, 1), (CIGAR.X, 9)]
-
mavis.bam.cigar.
extend_softclipping
(cigar, min_exact_to_stop_softclipping)[source]¶ given some input cigar, extends softclipping if there are mismatches/insertions/deletions close to the end of the aligned portion. The stopping point is defined by the min_exact_to_stop_softclipping parameter. this function will throw an error if there is no exact match aligned portion to signal stop
Parameters: Returns: Return type:
-
mavis.bam.cigar.
hgvs_standardize_cigar
(read, reference_seq)[source]¶ extend alignments as long as matches are possible. call insertions before deletions
-
mavis.bam.cigar.
join
(*pos)[source]¶ given a number of cigar lists, joins them and merges any consecutive tuples with the same cigar value
Example
>>> join([(1, 1), (4, 7)], [(4, 3), (2, 4)]) [(1, 1), (4, 10), (2, 4)]
-
mavis.bam.cigar.
longest_exact_match
(cigar)[source]¶ returns the longest consecutive exact match
Parameters: cigar ( list
oftuple
ofint
andint
) – the cigar tuples
-
mavis.bam.cigar.
longest_fuzzy_match
(cigar, max_fuzzy_interupt=1)[source]¶ computes the longest sequence of exact matches allowing for ‘x’ event interrupts
Parameters: - cigar – cigar tuples
- max_fuzzy_interupt (int) – number of mismatches allowed
-
mavis.bam.cigar.
match_percent
(cigar)[source]¶ calculates the percent of aligned bases (matches or mismatches) that are matches
-
mavis.bam.cigar.
merge_indels
(cigar)[source]¶ For a given cigar tuple, merges adjacent insertions/deletions
Example
>>> merge_indels([(CIGAR.EQ, 10), (CIGAR.I, 3), (CIGAR.D, 4), (CIGAR.I, 2), (CIGAR.D, 2), (CIGAR.EQ, 10)]) [(CIGAR.EQ, 10), (CIGAR.I, 5), (CIGAR.D, 6), (CIGAR.EQ, 10)]
-
mavis.bam.cigar.
merge_internal_events
(cigar, inner_anchor=10, outer_anchor=10)[source]¶ merges events (insertions, deletions, mismatches) within a cigar if they are between exact matches on either side (anchors) and separated by less exact matches than the given parameter
does not merge two mismatches, must contain a deletion/insertion
Parameters: Returns: new list of cigar tuples with merged events
Return type: Example
>>> merge_internal_events([(CIGAR.EQ, 10), (CIGAR.X, 1), (CIGAR.EQ, 2), (CIGAR.D, 1), (CIGAR.EQ, 10)]) [(CIGAR.EQ, 10), (CIGAR.I, 3), (CIGAR.D, 4), (CIGAR.EQ, 10)]
-
mavis.bam.cigar.
recompute_cigar_mismatch
(read, ref)[source]¶ for cigar tuples where M is used, recompute to replace with X/= for increased utility and specificity
Parameters: - read (pysam.AlignedSegment) – the input read
- ref (str) – the reference sequence
Returns: the cigar tuple
Return type:
read module¶
-
class
mavis.bam.read.
SamRead
(reference_name=None, next_reference_name=None, alignment_score=None, **kwargs)[source]¶ Bases:
pysam.libcalignedsegment.AlignedSegment
Subclass to extend the pysam.AlignedSegment class adding some utility methods and convenient representations
Allows next_reference_name and reference_name to be set directly so that is does not depend on a bam header
-
key
()[source]¶ uses a stored _key attribute, if available. This is to avoid the hash changing if the reference start (for example) is changed but also allow this attribute to be used and calculated for non SamRead objects
This way to change the hash behaviour the user must be explicit and use the set_key method
-
-
mavis.bam.read.
breakpoint_pos
(read, orient='?')[source]¶ assumes the breakpoint is the position following softclipping on the side with more softclipping (unless and orientation has been specified)
Parameters: - read (
AlignedSegment
) – the read object - orient (ORIENT) – the orientation
Returns: the position of the breakpoint in the input read
Return type: - read (
-
mavis.bam.read.
calculate_alignment_score
(read, consec_bonus=1)[source]¶ calculates a score for comparing alignments
Parameters: read (pysam.AlignedSegment) – the input read Returns: the score Return type: float
-
mavis.bam.read.
convert_events_to_softclipping
(read, orientation, max_event_size, min_anchor_size=None)[source]¶ given an alignment, simplifies the alignment by grouping everything past the first anchor and including the first event considered too large and unaligning them turning them into softclipping
-
mavis.bam.read.
map_ref_range_to_query_range
(read, ref_range)[source]¶ Parameters: - ref_range (Interval) – 1-based inclusive
- read (pysam.AlignedSegment) – read used for the mapping
Returns: 1-based inclusive range
Return type:
-
mavis.bam.read.
nsb_align
(ref, seq, weight_of_score=0.5, min_overlap_percent=1, min_match=0, min_consecutive_match=1, scoring_function=<function calculate_alignment_score>)[source]¶ given some reference string and a smaller sequence string computes the best non-space-breaking alignment i.e. an alignment that does not allow for indels (straight-match). Positions in the aligned segments are given relative to the length of the reference sequence (1-based)
Parameters: - ref (str) – the reference sequence
- seq (str) – the sequence being aligned
- weight_of_score (float) – when scoring alignments this determines the amount of weight to place on the cigar match. Should be a number between 0 and 1
- min_overlap_percent (float) – the minimum amount of overlap of the input sequence to the reference should be a number between 0 and 1
- min_match (float) – the minimum number of matches compared to total
- scoring_function (callable) – any function that will take a read as input and return a float used in comparing alignments to choose the best alignment
Returns: list of aligned segments
Return type: Note
using a higher min_match may improve performance as low quality alignments are rejected more quickly. However this may also result in no match being returned when there is no high quality match to be found.
-
mavis.bam.read.
orientation_supports_type
(read, event_type)[source]¶ checks if the orientation is compatible with the type of event
Parameters: - read (
AlignedSegment
) – a read from the pair - event_type (SVTYPE) – the type of event to check
Returns: True
- the read pair is in the correct orientation for this event typeFalse
- the read is not in the correct orientation
Return type: - read (
-
mavis.bam.read.
pileup
(reads, filter_func=None)[source]¶ For a given set of reads generate a pileup of all reads (excluding those for which the filter_func returns True)
Parameters: - reads (iterable of pysam.AlignedSegment) – reads to pileup
- filter_func (callable) – function which takes in a read and returns True if it should be ignored and False otherwise
Returns: tuples of genomic position and read count at that position
Return type: iterable of tuple of int and int
Note
returns positions using 1-based indexing
-
mavis.bam.read.
read_pair_type
(read)[source]¶ assumptions based on illumina pairs: only 4 possible combinations
Parameters: read ( AlignedSegment
) – the input readReturns: the type of input read pair Return type: READ_PAIR_TYPE Raises: NotImplementedError
– for any read that does not fall into the four expected configurations (see below)++++> <---- is LR same-strand ++++> ++++> is LL opposite <---- <---- is RR opposite <---- ++++> is RL same-strand
-
mavis.bam.read.
sequenced_strand
(read, strand_determining_read=2)[source]¶ determines the strand that was sequenced
Parameters: - read (
AlignedSegment
) – the read being used to determine the strand - strand_determining_read (int) – which read in the read pair is the same as the sequenced strand
Returns: the strand that was sequenced
Return type: STRAND
Raises: ValueError
– if strand_determining_read is not 1 or 2Warning
if the input pair is unstranded the information will not be representative of the strand sequenced since the assumed convention is not followed
- read (
stats module¶
-
class
mavis.bam.stats.
BamStats
(median_fragment_size, stdev_fragment_size, read_length)[source]¶ Bases:
object
-
class
mavis.bam.stats.
Histogram
[source]¶ Bases:
dict
-
mavis.bam.stats.
compute_genome_bam_stats
(bam_file_handle, sample_bin_size, sample_size, min_mapping_quality=1, sample_cap=10000, distribution_fraction=0.99)[source]¶ computes various statistical measures relating the input bam file
Parameters: - bam_file_handle (pysam.AlignmentFile) – the input bam file handle
- sample_bin_size (int) – how large to make the sample bin (in bp)
- sample_size (int) – the number of genes to compute stats over
- log (callable) – outputs logging information
- min_mapping_quality (int) – the minimum mapping quality for a read to be used
- sample_cap (int) – maximum number of reads to collect for any given sample region
- distribution_fraction (float) – the proportion of the distribution to use in computing stdev
Returns: the fragment size median, stdev and the read length in a object
Return type:
-
mavis.bam.stats.
compute_transcriptome_bam_stats
(bam_cache, annotations, sample_size, min_mapping_quality=1, stranded=True, sample_cap=10000, distribution_fraction=0.97)[source]¶ computes various statistical measures relating the input bam file
Parameters: - bam_file_handle (BamCache) – the input bam file handle
- annotations (object) – see
load_reference_genes()
- sample_size (int) – the number of genes to compute stats over
- log (callable) – outputs logging information
- min_mapping_quality (int) – the minimum mapping quality for a read to be used
- stranded (bool) – if True then reads must match the gene strand
- sample_cap (int) – maximum number of reads to collect for any given sample region
- distribution_fraction (float) – the proportion of the distribution to use in computing stdev
Returns: the fragment size median, stdev and the read length in a object
Return type:
blat module¶
In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered
zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first
100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is
another little unusual feature in the .psl format. It has to do with how coordinates are handled on the
negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward
strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
– http://wiki.bits.vib.be/index.php/Blat
-
class
mavis.blat.
Blat
[source]¶ Bases:
object
-
static
millibad
(row, is_protein=False, is_mrna=True)[source]¶ this function is used in calculating percent identity direct translation of the perl code # https://genome.ucsc.edu/FAQ/FAQblat.html#blat4
-
static
pslx_row_to_pysam
(row, bam_cache, reference_genome)[source]¶ given a ‘row’ from reading a pslx file. converts the row to a BlatAlignedSegment object
Parameters:
-
static
score
(row, is_protein=False)[source]¶ direct translation from ucsc guidelines on replicating the web blat score https://genome.ucsc.edu/FAQ/FAQblat.html#blat4
below are lines from the perl code i’ve re-written in python
my $sizeMul = pslIsProtein($blockCount, $strand, $tStart, $tEnd, $tSize, $tStarts, $blockSizes); sizmul = 1 for DNA my $pslScore = $sizeMul * ($matches + ($repMatches >> 1) ) - $sizeMul * $misMatches - $qNumInsert - $tNumIns ert)
-
static
breakpoint module¶
-
class
mavis.breakpoint.
Breakpoint
(chr, start, end=None, orient='?', strand='?', seq=None)[source]¶ Bases:
mavis.interval.Interval
class for storing information about a SV breakpoint coordinates are given as 1-indexed
Parameters: Examples
>>> Breakpoint('1', 1, 2) >>> Breakpoint('1', 1) >>> Breakpoint('1', 1, 2, 'R', ) >>> Breakpoint('1', 1, orient='R')
-
key
¶
-
-
class
mavis.breakpoint.
BreakpointPair
(b1, b2, stranded=False, opposing_strands=None, untemplated_seq=None, data=None, **kwargs)[source]¶ Bases:
object
Parameters: - b1 (Breakpoint) – the first breakpoint
- b2 (Breakpoint) – the second breakpoint
- stranded (bool) – if not stranded then +/- is equivalent to -/+
- opposing_strands (bool) – are the strands at the breakpoint opposite? i.e. +/- instead of +/+
- untemplated_seq (str) – seq between the breakpoints that is not part of either breakpoint
- data (dict) – optional dictionary of attributes associated with this pair
Note
untemplated_seq should always be given wrt to the positive/forward reference strand
Example
>>> BreakpointPair(Breakpoint('1', 1), Breakpoint('1', 9999), opposing_strands=True) >>> BreakpointPair(Breakpoint('1', 1, strand='+'), Breakpoint('1', 9999, strand='-'))
-
breakpoint_sequence_homology
(reference_genome)[source]¶ for a given set of breakpoints matches the sequence opposite the partner breakpoint this sequence comparison is done with reference to a reference genome and does not use novel or untemplated sequence in the comparison. For this reason, insertions will never return any homologous sequence
small duplication event CTT => CTTCTT GATACATTTCTTCTTGAAAA reference ---------<========== first breakpoint ===========>-------- second breakpoint ---------CT-CT------ first break homology -------TT-TT-------- second break homology
Parameters: reference_genome ( dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr nameReturns: Return type: tuple Raises: AttributeError
– for non specific breakpoints
-
classmethod
classify
(pair, distance=None)[source]¶ uses the chr, orientations and strands to determine the possible structural_variant types that this pair could support
Parameters: - pair (BreakpointPair) – the pair to classify
- distance (callable) – if defined, will be passed to net size to use in narrowing the list of putative types (del vs ins)
Returns: a list of possible SVTYPE
Return type: Example
>>> bpp = BreakpointPair(Breakpoint('1', 1), Breakpoint('1', 9999), opposing_strands=True) >>> BreakpointPair.classify(bpp) ['inversion'] >>> bpp = BreakpointPair(Breakpoint('1', 1, orient='L'), Breakpoint('1', 9999, orient='R'), opposing_strands=False) >>> BreakpointPair.classify(bpp) {'deletion', 'insertion'}
-
flatten
()[source]¶ returns the key-value self for the breakpoint self information as can be written directly as a tab row
-
is_putative_indel
¶
cluster subpackage¶
Sub-package Documentation¶
The cluster sub-package is responsible for merging variants coming from different inputs (i.e. different tools).
Types of Output Files¶
expected name/suffix | file type/format | content |
---|---|---|
cluster_assignment.tab |
text/tabbed | |
uninformative_clusters.txt |
text | list of cluster ids that were dropped by annotation proximity filter |
clusters.bed |
bed | cluster positions |
cluster-*.tab |
text/tabbed | computed clusters |
Algorithm Overview¶
Collapse any duplicate breakpoint pairs
Split breakpoint pairs by type
Cluster breakpoint pairs by distance (within a type)
- Create a graph representation of the distances between pairs
- Find cliques up to a given input size (cluster_clique_size)
- Hierarchically cluster the cliques (allows redundant participation)
- For each input node/pair pick the best cluster(s)
Output the clusters and the mapping to the input pairs
modules
cluster module¶
-
mavis.cluster.cluster.
merge_breakpoint_pairs
(input_pairs, cluster_radius=200, cluster_initial_size_limit=25, verbose=False)[source]¶ two-step merging process
- merges all ‘small’ (see cluster_initial_size_limit) events as the union of all events that
- fall within the cluster_radius
- for all remaining events choose the ‘best’ merge for any event within cluster_radius of an
- existing node. Otherwise the node is added unmerged. The events in the second phase are done in order of smallest total breakpoint interval size to largest
Parameters: Returns: mapping of merged breakpoint pairs to the input pairs used in the merge
Return type: dict of list of BreakpointPair by BreakpointPair
-
mavis.cluster.cluster.
merge_by_union
(input_pairs, group_key, weight_adjustment=10, cluster_radius=200)[source]¶ for a given set of breakpoint pairs, merge the union of all pairs that are within the given distance (cluster_radius)
-
mavis.cluster.cluster.
merge_integer_intervals
(*intervals, weight_adjustment=0)[source]¶ Merges a set of integer intervals into a single interval where the center is the weighted mean of the input intervals. The weight is inversely proportional to the length of each interval. The length of the final interval is the average of the lengths of the input intervals capped in size so that it never extends beyond the union of the input intervals
Parameters: weight_adjustment (int) – add to length to lower weighting differences between small intervals
constants module¶
-
mavis.cluster.constants.
DEFAULTS
= WeakMavisNamespace(cluster_initial_size_limit=25, cluster_radius=100, limit_to_chr=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'], max_files=200, max_proximity=5000, min_clusters_per_file=50, uninformative_filter=False)¶
main module¶
-
mavis.cluster.main.
main
(inputs, output, strand_specific, library, protocol, disease_status, masking, annotations, limit_to_chr=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'], cluster_initial_size_limit=25, cluster_radius=100, uninformative_filter=False, max_proximity=5000, min_clusters_per_file=50, max_files=200, batch_id=None, split_only=False, start_time=1558720114, **kwargs)[source]¶ Parameters: - inputs (
List
ofstr
) – list of input files to read - output (str) – path to the output directory
- strand_specific (bool) – is the bam using a strand specific protocol
- library (str) – the library to look for in each of the input files
- protocol (PROTOCOL) – the sequence protocol (genome or transcriptome)
- masking (object) – see
load_masking_regions()
- cluster_clique_size (int) – the maximum size of cliques to search for using the exact algorithm
- cluster_radius (int) – distance (in breakpoint pairs) used in deciding to join bpps in a cluster
- uninformative_filter (bool) – if True then clusters should be filtered out if they are not within a specified (max_proximity) distance to any annotation
- max_proximity (int) – the maximum distance away an annotation can be before the uninformative_filter is applied
- annotations (ReferenceFile) – see
load_reference_genes()
- min_clusters_per_file (int) – the minimum number of clusters to output to a file
- max_files (int) – the maximum number of files to split clusters into
- inputs (
-
mavis.cluster.main.
split_clusters
(clusters, outputdir, batch_id, min_clusters_per_file=0, max_files=1, write_bed_summary=True)[source]¶ For a set of clusters creates a bed file representation of all clusters. Also splits the clusters evenly into multiple files based on the user parameters (min_clusters_per_file, max_files)
Returns: of output file names (not including the bed file) Return type: list
config module¶
-
class
mavis.config.
CustomHelpFormatter
(prog, indent_increment=2, max_help_position=24, width=None)[source]¶ Bases:
argparse.ArgumentDefaultsHelpFormatter
subclass the default help formatter to stop default printing for required arguments
-
class
mavis.config.
LibraryConfig
(library, protocol, disease_status, bam_file=None, inputs=None, read_length=None, median_fragment_size=None, stdev_fragment_size=None, strand_specific=False, strand_determining_read=2, **kwargs)[source]¶ Bases:
mavis.constants.MavisNamespace
holds library specific configuration information
-
class
mavis.config.
RangeAppendAction
(nmin=1, nmax=None, **kwargs)[source]¶ Bases:
argparse.Action
allows an argument to accept a range of arguments
-
mavis.config.
augment_parser
(arguments, parser, required=None)[source]¶ Adds options to the argument parser. Separate function to facilitate the pipeline steps all having a similar look/feel
-
mavis.config.
generate_config
(args, parser, log=<mavis.util.Log object>)[source]¶ Parameters: - parser (argparse.ArgumentParser) – the main parser
- required – the argparse required arguments group
- optional – the argparse optional arguments group
-
mavis.config.
get_metavar
(arg_type)[source]¶ For a given argument type, returns the string to be used for the metavar argument in add_argument
Example
>>> get_metavar(bool) '{True,False}'
-
mavis.config.
nameable_string
(input_string)[source]¶ A string that can be used for library and/or filenames
-
mavis.config.
validate_section
(section, namespace, use_defaults=False)[source]¶ given a dictionary of values, returns a new dict with the values casted to their appropriate type or set to a default if the value was not given
-
mavis.config.
write_config
(filename, include_defaults=False, libraries=[], conversions={}, log=<mavis.util.Log object>)[source]¶ Parameters: - filename (str) – path to the output file
- include_defaults (bool) – True if default parameters should be written to the config, False otherwise
- libraries (list of LibraryConfig) – library configuration sections
- conversions (dict of list by str) – conversion commands by alias name
- log (function) – function to pass output logging to
constants module¶
module responsible for small utility functions and constants used throughout the structural_variant package
-
mavis.constants.
CALL_METHOD
= MavisNamespace(CONTIG='contig', FLANK='flanking reads', INPUT='input', SPAN='spanning reads', SPLIT='split reads')¶ holds controlled vocabulary for allowed call methods
CONTIG
: a contig was assembled and aligned across the breakpointsSPLIT
: the event was called by split readFLANK
: the event was called by flanking read pairSPAN
: the event was called by spanning read
Type: MavisNamespace
-
mavis.constants.
CIGAR
= MavisNamespace(D=2, EQ=7, H=5, I=1, M=0, N=3, P=6, S=4, X=8)¶ Enum-like. For readable cigar values
M
: alignment match (can be a sequence match or mismatch)I
: insertion to the referenceD
: deletion from the referenceN
: skipped region from the referenceS
: soft clipping (clipped sequences present in SEQ)H
: hard clipping (clipped sequences NOT present in SEQ)P
: padding (silent deletion from padded reference)EQ
: sequence match (=)X
: sequence mismatch
note: descriptions are taken from the samfile documentation
Type: MavisNamespace
-
mavis.constants.
COLUMNS
= MavisNamespace(annotation_figure='annotation_figure', annotation_figure_legend='annotation_figure_legend', annotation_id='annotation_id', assumed_untemplated='assumed_untemplated', break1_chromosome='break1_chromosome', break1_ewindow='break1_ewindow', break1_ewindow_count='break1_ewindow_count', break1_ewindow_practical_coverage='break1_ewindow_practical_coverage', break1_homologous_seq='break1_homologous_seq', break1_orientation='break1_orientation', break1_position_end='break1_position_end', break1_position_start='break1_position_start', break1_seq='break1_seq', break1_split_read_names='break1_split_read_names', break1_split_reads='break1_split_reads', break1_split_reads_forced='break1_split_reads_forced', break1_strand='break1_strand', break2_chromosome='break2_chromosome', break2_ewindow='break2_ewindow', break2_ewindow_count='break2_ewindow_count', break2_ewindow_practical_coverage='break2_ewindow_practical_coverage', break2_homologous_seq='break2_homologous_seq', break2_orientation='break2_orientation', break2_position_end='break2_position_end', break2_position_start='break2_position_start', break2_seq='break2_seq', break2_split_read_names='break2_split_read_names', break2_split_reads='break2_split_reads', break2_split_reads_forced='break2_split_reads_forced', break2_strand='break2_strand', call_method='call_method', call_sequence_complexity='call_sequence_complexity', cdna_synon='cdna_synon', cluster_id='cluster_id', cluster_size='cluster_size', contig_alignment_query_consumption='contig_alignment_query_consumption', contig_alignment_query_name='contig_alignment_query_name', contig_alignment_rank='contig_alignment_rank', contig_alignment_score='contig_alignment_score', contig_break1_read_depth='contig_break1_read_depth', contig_break2_read_depth='contig_break2_read_depth', contig_build_score='contig_build_score', contig_read_depth='contig_read_depth', contig_remap_coverage='contig_remap_coverage', contig_remap_score='contig_remap_score', contig_remapped_read_names='contig_remapped_read_names', contig_remapped_reads='contig_remapped_reads', contig_seq='contig_seq', contig_strand_specific='contig_strand_specific', contigs_assembled='contigs_assembled', disease_status='disease_status', event_type='event_type', exon_first_3prime='exon_first_3prime', exon_last_5prime='exon_last_5prime', filter_comment='filter_comment', flanking_median_fragment_size='flanking_median_fragment_size', flanking_pairs='flanking_pairs', flanking_pairs_compatible='flanking_pairs_compatible', flanking_pairs_compatible_read_names='flanking_pairs_compatible_read_names', flanking_pairs_read_names='flanking_pairs_read_names', flanking_stdev_fragment_size='flanking_stdev_fragment_size', fusion_cdna_coding_end='fusion_cdna_coding_end', fusion_cdna_coding_start='fusion_cdna_coding_start', fusion_mapped_domains='fusion_mapped_domains', fusion_protein_hgvs='fusion_protein_hgvs', fusion_sequence_fasta_file='fusion_sequence_fasta_file', fusion_sequence_fasta_id='fusion_sequence_fasta_id', fusion_splicing_pattern='fusion_splicing_pattern', gene1='gene1', gene1_aliases='gene1_aliases', gene1_direction='gene1_direction', gene2='gene2', gene2_aliases='gene2_aliases', gene2_direction='gene2_direction', gene_product_type='gene_product_type', genes_encompassed='genes_encompassed', genes_overlapping_break1='genes_overlapping_break1', genes_overlapping_break2='genes_overlapping_break2', genes_proximal_to_break1='genes_proximal_to_break1', genes_proximal_to_break2='genes_proximal_to_break2', inferred_pairing='inferred_pairing', library='library', linking_split_read_names='linking_split_read_names', linking_split_reads='linking_split_reads', net_size='net_size', opposing_strands='opposing_strands', pairing='pairing', product_id='product_id', protein_synon='protein_synon', protocol='protocol', raw_break1_half_mapped_reads='raw_break1_half_mapped_reads', raw_break1_split_reads='raw_break1_split_reads', raw_break2_half_mapped_reads='raw_break2_half_mapped_reads', raw_break2_split_reads='raw_break2_split_reads', raw_flanking_pairs='raw_flanking_pairs', raw_spanning_reads='raw_spanning_reads', repeat_count='repeat_count', spanning_read_names='spanning_read_names', spanning_reads='spanning_reads', stranded='stranded', supplementary_call='supplementary_call', tools='tools', tracking_id='tracking_id', transcript1='transcript1', transcript2='transcript2', untemplated_seq='untemplated_seq', validation_id='validation_id')¶ Column names for i/o files used throughout the pipeline
- annotation_figure_legend
- annotation_figure
- annotation_id
- break1_chromosome
- break1_ewindow_count
- break1_ewindow_practical_coverage
- break1_ewindow
- break1_homologous_seq
- break1_orientation
- break1_position_end
- break1_position_start
- break1_seq
- break1_split_reads_forced
- break1_split_reads
- break1_strand
- break2_chromosome
- break2_ewindow_count
- break2_ewindow_practical_coverage
- break2_ewindow
- break2_homologous_seq
- break2_orientation
- break2_position_end
- break2_position_start
- break2_seq
- break2_split_reads_forced
- break2_split_reads
- break2_strand
- call_method
- cdna_synon
- cluster_id
- cluster_size
- contig_alignment_cigar
- contig_alignment_query_name
- contig_alignment_reference_start
- contig_alignment_score
- contig_build_score
- contig_remap_coverage
- contig_remap_score
- contig_remapped_read_names
- contig_remapped_reads
- contig_seq
- contig_strand_specific
- contigs_assembled
- call_sequence_complexity
- event_type
- flanking_median_fragment_size
- flanking_pairs_compatible
- flanking_pairs
- flanking_stdev_fragment_size
- fusion_cdna_coding_end
- fusion_cdna_coding_end
- fusion_cdna_coding_start
- fusion_mapped_domains
- fusion_sequence_fasta_file
- fusion_sequence_fasta_id
- fusion_splicing_pattern
- fusion_protein_hgvs
- gene1_aliases
- gene1_direction
- gene1
- gene2_aliases
- gene2_direction
- gene2
- gene_product_type
- genes_encompassed
- genes_overlapping_break1
- genes_overlapping_break2
- genes_proximal_to_break1
- genes_proximal_to_break2
- inferred_pairing
- library
- linking_split_reads
- net_size
- opposing_strands
- pairing
- product_id
- protein_synon
- protocol
- raw_break1_split_reads
- raw_break2_split_reads
- raw_flanking_pairs
- raw_spanning_reads
- spanning_read_names
- spanning_reads
- stranded
- tools
- tracking_id
- transcript1
- transcript2
- supplementary_call
- untemplated_seq
- validation_id
- repeat_count
- assumed_untemplated
Type: MavisNamespace
-
mavis.constants.
DISEASE_STATUS
= MavisNamespace(DISEASED='diseased', NORMAL='normal')¶ holds controlled vocabulary for allowed disease status
DISEASED
: diseasedNORMAL
: normal
Type: MavisNamespace
-
mavis.constants.
GENE_PRODUCT_TYPE
= MavisNamespace(ANTI_SENSE='anti-sense', SENSE='sense')¶ controlled vocabulary for gene products
SENSE
: the gene product is a sense fusionANTI_SENSE
: the gene product is anti-sense
Type: MavisNamespace
-
mavis.constants.
GIEMSA_STAIN
= MavisNamespace(ACEN='acen', GNEG='gneg', GPOS100='gpos100', GPOS25='gpos25', GPOS33='gpos33', GPOS50='gpos50', GPOS66='gpos66', GPOS75='gpos75', GVAR='gvar', STALK='stalk')¶ holds controlled vocabulary relating to stains of chromosome bands
Type: MavisNamespace
-
class
mavis.constants.
MavisNamespace
(*pos, **kwargs)[source]¶ Bases:
object
Namespace to hold module constants
Example
>>> nspace = MavisNamespace(thing=1, otherthing=2) >>> nspace.thing 1 >>> nspace.otherthing 2
-
DELIM
= '[;,\\s]+'¶ delimiter to use is parsing listable variables from the environment or config file
Type: str
-
add
(attr, value, defn=None, cast_type=None, nullable=False, env_overwritable=False, listable=False)[source]¶ Add an attribute to the name space
Parameters: - attr (str) – name of the attribute being added
- value – the value of the attribute
- defn (str) – the definition, will be used in generating documentation and help menus
- cast_type (callable) – the function to use in casting the value
- nullable (bool) – True if this attribute can have a None value
- env_overwritable (bool) – True if this attribute will be overriden by its environment variable equivalent
- listable (bool) – True if this attribute can have multiple values
Example
>>> nspace = MavisNamespace() >>> nspace.add('thing', 1, int, 'I am a thing') >>> nspace = MavisNamespace() >>> nspace.add('thing', 1, int) >>> nspace = MavisNamespace() >>> nspace.add('thing', 1) >>> nspace = MavisNamespace() >>> nspace.add('thing', value=1, cast_type=int, defn='I am a thing')
-
define
(attr, *pos)[source]¶ Get the definition of a given attribute or return a default (when given) if the attribute does not exist
Returns: definition for the attribute Return type: str Raises: KeyError
– the attribute does not exist and a default was not givenExample
>>> nspace = MavisNamespace() >>> nspace.add('thing', 1, defn='I am a thing') >>> nspace.add('otherthing', 2) >>> nspace.define('thing') 'I am a thing' >>> nspace.define('otherthing') Traceback (most recent call last): .... >>> nspace.define('otherthing', 'I am some other thing') 'I am some other thing'
-
enforce
(value)[source]¶ checks that the current namespace has a given value
Returns: the input value Raises: KeyError
– the value did not existExample
>>> nspace = MavisNamespace(thing=1, otherthing=2) >>> nspace.enforce(1) 1 >>> nspace.enforce(3) Traceback (most recent call last): ....
-
get
(key, *pos)[source]¶ get an attribute, return a default (if given) if the attribute does not exist
Example
>>> nspace = MavisNamespace(thing=1, otherthing=2) >>> nspace.get('thing', 2) 1 >>> nspace.get('nonexistant_thing', 2) 2 >>> nspace.get('nonexistant_thing') Traceback (most recent call last): ....
-
get_env_name
(attr)[source]¶ Get the name of the corresponding environment variable
Example
>>> nspace = MavisNamespace(a=1) >>> nspace.get_env_name('a') 'MAVIS_A'
-
is_env_overwritable
(attr)[source]¶ Returns: True if the variable is overrided by specifying the environment variable equivalent Return type: bool
-
items
()[source]¶ Example
>>> MavisNamespace(thing=1, otherthing=2).items() [('thing', 1), ('otherthing', 2)]
-
keys
()[source]¶ get the attribute keys as a list
Example
>>> MavisNamespace(thing=1, otherthing=2).keys() ['thing', 'otherthing']
-
classmethod
parse_listable_string
(string, cast_type=<class 'str'>, nullable=False)[source]¶ Given some string, parse it into a list
Example
>>> MavisNamespace.parse_listable_string('1,2,3', int) [1, 2, 3] >>> MavisNamespace.parse_listable_string('1;2,None', int, True) [1, 2, None]
-
reverse
(value)[source]¶ for a given value, return the associated key
Parameters: value – the value to get the key/attribute name for
Raises: Example
>>> nspace = MavisNamespace(thing=1, otherthing=2) >>> nspace.reverse(1) 'thing'
-
-
mavis.constants.
NA_MAPPING_QUALITY
= 255¶ mapping quality value to indicate mapping was not performed/calculated
Type: int
-
mavis.constants.
ORIENT
= MavisNamespace(LEFT='L', NS='?', RIGHT='R', compare=<function <lambda>>, expand=<function <lambda>>)¶ holds controlled vocabulary for allowed orientation values
LEFT
: left wrt to the positive/forward strandRIGHT
: right wrt to the positive/forward strandNS
: orientation is not specified
Type: MavisNamespace
-
mavis.constants.
PRIME
= MavisNamespace(FIVE=5, THREE=3)¶ holds controlled vocabulary
FIVE
: five primeTHREE
: three prime
Type: MavisNamespace
-
mavis.constants.
PROTOCOL
= MavisNamespace(GENOME='genome', TRANS='transcriptome')¶ holds controlled vocabulary for allowed protocol values
GENOME
: genomeTRANS
: transcriptome
Type: MavisNamespace
-
mavis.constants.
PYSAM_READ_FLAGS
= MavisNamespace(BLAT_ALIGNMENTS='ba', BLAT_PERCENT_IDENTITY='bi', BLAT_PMS='bp', BLAT_RANK='br', BLAT_SCORE='bs', FIRST_IN_PAIR=64, LAST_IN_PAIR=128, MATE_REVERSE=32, MATE_UNMAPPED=8, MULTIMAP=1, RECOMPUTED_CIGAR='rc', REVERSE=16, SECONDARY=256, SUPPLEMENTARY=2048, TARGETED_ALIGNMENT='ta', UNMAPPED=4)¶ Enum-like. For readable PYSAM flag constants
MULTIMAP
: template having multiple segments in sequencingUNMAPPED
: segment unmappedMATE_UNMAPPED
: next segment in the template unmappedREVERSE
: SEQ being reverse complementedMATE_REVERSE
: SEQ of the next segment in the template being reverse complementedFIRST_IN_PAIR
: the first segment in the templateLAST_IN_PAIR
: the last segment in the templateSECONDARY
: secondary alignmentSUPPLEMENTARY
: supplementary alignment
note: descriptions are taken from the samfile documentation
Type: MavisNamespace
-
mavis.constants.
STRAND
= MavisNamespace(NEG='-', NS='?', POS='+', compare=<function <lambda>>, expand=<function <lambda>>)¶ holds controlled vocabulary for allowed strand values
POS
: the positive/forward strandNEG
: the negative/reverse strandNS
: strand is not specified
Type: MavisNamespace
-
mavis.constants.
SUBCOMMAND
= MavisNamespace(ANNOTATE='annotate', CLUSTER='cluster', CONFIG='config', CONVERT='convert', OVERLAY='overlay', PAIR='pairing', SCHEDULE='schedule', SETUP='setup', SUMMARY='summary', VALIDATE='validate')¶ holds controlled vocabulary for allowed pipeline stage values
- annotate
- cluster
- config
- convert
- pairing
- pipeline
- schedule
- summary
- validate
Type: MavisNamespace
-
mavis.constants.
SVTYPE
= MavisNamespace(DEL='deletion', DUP='duplication', INS='insertion', INV='inversion', ITRANS='inverted translocation', TRANS='translocation')¶ holds controlled vocabulary for acceptable structural variant classifications
DEL
: deletionTRANS
: translocationITRANS
: inverted translocationINV
: inversionINS
: insertionDUP
: duplication
Type: MavisNamespace
-
mavis.constants.
float_fraction
(num)[source]¶ cast input to a float
Parameters: num – input to cast Returns: float Raises: TypeError
– if the input cannot be cast to a float or the number is not between 0 and 1
-
mavis.constants.
reverse_complement
(s)[source]¶ wrapper for the Bio.Seq reverse_complement method
Parameters: s (str) – the input DNA sequence Returns: the reverse complement of the input sequence Return type: str
Warning
assumes the input is a DNA sequence
Example
>>> reverse_complement('ATCCGGT') 'ACCGGAT'
error module¶
illustrate subpackage¶
constants module¶
-
mavis.illustrate.constants.
DEFAULTS
= WeakMavisNamespace(breakpoint_color='#000000', domain_color='#ccccb3', domain_mismatch_color='#b2182b', domain_name_regex_filter='^PF\\d+$', domain_scaffold_color='#000000', drawing_width_iter_increase=500, exon_min_focus_size=10, gene1_color='#657e91', gene1_color_selected='#518dc5', gene2_color='#325556', gene2_color_selected='#4c9677', label_color='#000000', mask_fill='#ffffff', mask_opacity=0.7, max_drawing_retries=5, novel_exon_color='#5D3F6A', scaffold_color='#000000', splice_color='#000000', width=1000)¶
diagram module¶
This is the primary module responsible for generating svg visualizations
-
mavis.illustrate.diagram.
draw_multi_transcript_overlay
(config, gene, vmarkers=None, window_buffer=0, plots=None, log=<mavis.util.Log object>)[source]¶
-
mavis.illustrate.diagram.
draw_sv_summary_diagram
(config, ann, reference_genome=None, templates=None, ignore_absent_templates=True, user_friendly_labels=True, template_display_label_prefix='', draw_reference_transcripts=True, draw_reference_genes=True, draw_reference_templates=True, draw_fusion_transcript=True, stack_reference_transcripts=False)[source]¶ this is the main drawing function. It decides between layouts where each view-level is split into one or two diagrams (side-by-side) dependant on whether the event is interchromosomal, within a single transcript, etc.
- Diagrams have four levels
- template
- gene
- transcript
- fusion transcript/translation
Parameters: - ann (Annotation) – the annotation object to be illustrated
- reference_genome (dict of str by str) – reference sequences
- templates (list of Template) – list of templates, used in drawing the template-level view
- ignore_absent_templates (bool) – if true then will not raise an error if the template information is not given but will not draw the template instead
- show_template (bool) – if false the template-level view is not drawn
- user_friendly_labels (bool) – if True, genes are labelled by their aliases (where possible) and domains are labeled by their names (where possible)
- template_display_label_prefix (str) – the character to precede the template label
elements module¶
This is the primary module responsible for generating svg visualizations
-
mavis.illustrate.elements.
draw_breakpoint
(config, canvas, breakpoint, width, height, label='')[source]¶ Parameters: - canvas (svgwrite.drawing.Drawing) – the main svgwrite object used to create new svg elements
- breakpoint (Breakpoint) – the breakpoint to draw
- width (int) – the pixel width
- height (int) – the pixel height
Returns: the group element for the diagram
Return type: svgwrite.container.Group
-
mavis.illustrate.elements.
draw_exon
(config, canvas, exon, width, height, fill, label='', translation=None)[source]¶ generates the svg object representing an exon
Parameters: Returns: the group element for the diagram
Return type: svgwrite.container.Group
Todo
add markers for exons with abrogated splice sites
-
mavis.illustrate.elements.
draw_exon_track
(config, canvas, transcript, mapping, colors=None, genomic_min=None, genomic_max=None, translation=None)[source]¶
-
mavis.illustrate.elements.
draw_gene
(config, canvas, gene, width, height, fill, label='', reference_genome=None)[source]¶ generates the svg object representing a gene
Parameters: Returns: the group element for the diagram
Return type: svgwrite.container.Group
-
mavis.illustrate.elements.
draw_genes
(config, canvas, genes, target_width, breakpoints=None, colors=None, labels=None, plots=None, masks=None)[source]¶ draws the genes given in order of their start position trying to minimize the number of tracks required to avoid overlap
Parameters: - canvas (svgwrite.drawing.Drawing) – the main svgwrite object used to create new svg elements
- target_width (int) – the target width of the diagram
- genes (
list
ofGene
) – the list of genes to draw - breakpoints (
list
ofBreakpoint
) – the breakpoints to overlay - colors (
dict
ofGene
andstr
) – dictionary of the colors assigned to each Gene as fill
Returns: - the group element for the diagram.
Has the added parameters of labels, height, and mapping
Return type: svgwrite.container.Group
-
mavis.illustrate.elements.
draw_legend
(config, canvas, swatches, border=True)[source]¶ generates an svg group object representing the legend
-
mavis.illustrate.elements.
draw_template
(config, canvas, template, target_width, labels=None, colors=None, breakpoints=None)[source]¶ Creates the template/chromosome illustration
Returns: the group element for the diagram Return type: svgwrite.container.Group
-
mavis.illustrate.elements.
draw_transcript_with_translation
(config, canvas, translation, labels, colors, mapping, reference_genome=None, genomic_min=None, genomic_max=None)[source]¶
-
mavis.illustrate.elements.
draw_ustranscript
(config, canvas, pre_transcript, target_width=None, breakpoints=[], labels=<mavis.illustrate.util.LabelMapping object>, colors={}, mapping=None, reference_genome=None, masks=None)[source]¶ builds an svg group representing the transcript. Exons are drawn in a track with the splicing information and domains are drawn in separate tracks below
if there are multiple splicing variants then multiple exon tracks are drawn
Parameters: - canvas (svgwrite.drawing.Drawing) – the main svgwrite object used to create new svg elements
- target_width (int) – the target width of the diagram
- pre_transcript (Transcript) – the transcript being drawn
- exon_color (str) – the color being used for the fill of the exons
- utr_color (str) – the color for the fill of the UTR regions
- abrogated_splice_sites (
list
ofint
) – list of positions to ignore as splice sites - breakpoints (
list
ofBreakpoint
) – the breakpoints to overlay
Returns: - the group element for the transcript diagram
Has the added parameters of labels, height, and mapping
Return type: svgwrite.container.Group
-
mavis.illustrate.elements.
draw_vmarker
(config, canvas, marker, width, height, label='', color=None)[source]¶ Parameters: - canvas (svgwrite.drawing.Drawing) – the main svgwrite object used to create new svg elements
- breakpoint (Breakpoint) – the breakpoint to draw
- width (int) – the pixel width
- height (int) – the pixel height
Returns: the group element for the diagram
Return type: svgwrite.container.Group
scatter module¶
-
class
mavis.illustrate.scatter.
ScatterPlot
(points, y_axis_label, ymax=None, ymin=None, xmin=None, xmax=None, hmarkers=None, height=100, point_radius=2, title='', yticks=None, colors=None, density=1, ymax_color='#FF0000')[source]¶ Bases:
object
holds settings that will go into matplotlib after conversion using the mapping system
-
mavis.illustrate.scatter.
bam_to_scatter
(bam_file, chrom, start, end, density, strand=None, axis_name=None, ymax=None, min_mapping_quality=0, ymax_color='#FF0000')[source]¶ pull data from a bam file to set up a scatter plot of the pileup
Parameters: - bam_file (str) – path to the bam file
- chrom (str) – chromosome name
- start (int) – genomic start position for the plot
- end (int) – genomic end position for the plot
- bin_size (int) – number of genomic positions to group together and average to reduce data
- strand (STRAND) – expected strand
- axis_name (str) – axis name
- ymax (int) – maximum value to plot the y axis
- min_mapping_quality (int) – minimum mapping quality for reads to be considered in the plot
Returns: the scatter plot representing the bam pileup
Return type:
-
mavis.illustrate.scatter.
draw_scatter
(ds, canvas, plot, xmapping, log=<mavis.util.Log object>)[source]¶ given a xmapping, draw the scatter plot svg group
Parameters: - ds (DiagramSettings) – the settings/constants to use for building the svg
- canvas (svgwrite.canvas) – the svgwrite object used to create new svg elements
- plot (ScatterPlot) – the plot to be drawn
- xmapping (
dict
ofInterval
byInterval
) – dict used for conversion of coordinates in the xaxis to pixel positions
util module¶
-
class
mavis.illustrate.util.
Tag
(elementname, content='', **kwargs)[source]¶ Bases:
svgwrite.base.BaseElement
-
mavis.illustrate.util.
dynamic_label_color
(color)[source]¶ calculates the luminance of a color and determines if a black or white label will be more contrasting
interval module¶
-
class
mavis.interval.
Interval
(start, end=None, freq=1, number_type=None)[source]¶ Bases:
object
Parameters: -
__and__
(other)[source]¶ the intersection of two intervals
Example
>>> Interval(1, 10) & Interval(5, 50) Interval(5, 10) >>> Interval(1, 2) & Interval(10, 11) None
-
__len__
()[source]¶ the length of the interval
Example
>>> len(Interval(1, 11)) 12
Warning
only works for integer intervals
-
__or__
(other)[source]¶ the union of two intervals
Example
>>> Interval(1, 10) | Interval(5, 50) Interval(1, 50) >>> Interval(1, 2) | Interval(10, 11) Interval(1, 11)
-
__sub__
(other)[source]¶ the difference of two intervals
Example
>>> Interval(1, 10) - Interval(5, 50) [Interval(1, 4)] >>> Interval(1, 2) - Interval(10, 11) [Interval(1, 2)] >>> Interval(1, 2) - Interval(-1, 10) []
-
center
¶ the middle of the interval
Example
>>> Interval(1, 10).center 5.5 >>> Interval(1, 11).center 6
-
classmethod
convert_ratioed_pos
(mapping, pos, forward_to_reverse=None)[source]¶ convert any given position given a mapping of intervals to another range
Parameters: Returns: the position in the alternate coordinate system given the input mapping
Return type: Raises: AttributeError
– if the input position is outside the set of input segmentsIndexError
– if the input position cannot be converted to the output system
Example
>>> mapping = {(1, 10): (101, 110), (11, 20): (555, 564)} >>> Interval.convert_pos(mapping, 5) 5 >>> Interval.convert_pos(mapping, 15) 559
-
classmethod
dist
(first, other)[source]¶ returns the minimum distance between intervals
Example
>>> Interval.dist((1, 4), (5, 7)) -1 >>> Interval.dist((5, 7), (1, 4)) 1 >>> Interval.dist((5, 8), (7, 9)) 0
-
classmethod
intersection
(*intervals)[source]¶ returns None if there is no intersection
Example
>>> Interval.intersection((1, 10), (2, 8), (7, 15)) Interval(7, 8) >>> Interval.intersection((1, 2), (5, 9)) None
-
classmethod
min_nonoverlapping
(*intervals)[source]¶ for a list of intervals, orders them and merges any overlap to return a list of non-overlapping intervals O(nlogn)
Example
>>> Interval.min_nonoverlapping((1, 10), (7, 8), (6, 14), (17, 20)) [Interval(1, 14), Interval(17, 20)]
-
classmethod
overlaps
(first, other)[source]¶ checks if two intervals have any portion of their given ranges in common
Parameters: Example
>>> Interval.overlaps(Interval(1, 4), Interval(5, 7)) False >>> Interval.overlaps(Interval(1, 10), Interval(10, 11)) True >>> Interval.overlaps((1, 10), (10, 11)) True
-
-
class
mavis.interval.
IntervalMapping
(mapping=None, opposing=None)[source]¶ Bases:
object
mapping between coordinate systems using intervals. source intervals cannot overlap but no such assertion is enforced on the target intervals
-
convert_pos
(pos)[source]¶ convert any given position given a mapping of intervals to another range
Parameters: pos (int) – a position in the first coordinate system Returns: the position in the alternate coordinate system given the input mapping - int: if simplify is True - Interval: if simplify is False Raises: IndexError
– if the input position is not in any of the mapped intervalsExample
>>> mapping = IntervalMapping(mapping={(1, 10): (101, 110), (11, 20): (555, 564)}) >>> mapping.convert_pos(5) 5 >>> mapping.convert_pos(15) 559
-
convert_ratioed_pos
(pos)[source]¶ convert any given position given a mapping of intervals to another range
Parameters: pos (Interval) – a position in the first coordinate system Returns: the position in the alternate coordinate system given the input mapping - int: if simplify is True - Interval: if simplify is False Raises: IndexError
– if the input position is not in any of the mapped intervalsExample
>>> mapping = IntervalMapping(mapping={(1, 10): (101, 110), (11, 20): (555, 564)}) >>> mapping.convert_pos(5) 5 >>> mapping.convert_pos(15) 559
-
pairing subpackage¶
Sub-package Documentation¶
This is the package responsible for pairing/grouping calls between libraries. In many cases this will be where somatic vs germline is determined or genomic only vs expressed.
Output Files¶
expected name/suffix | file type/format | content |
---|---|---|
mavis_paired_*.tab |
text/tabbed | call information and pairing information using product id |
Algorithm Overview¶
pairwise comparison of breakpoint pairs between libraries
fail if orientations do not match
fail if template/chromosomes do not match
if the protocols are mixed
- pass if the fusion products match at the sequence level
- pass if the breakpoint predicted from the genome matches the transcriptome breakpoint
if the protocols are the same
- pass if the breakpoints are co-located
filter matches based on annotations
- if both breakpoints have the same gene annotation, they must also both have the same transcript annotation
modules
constants module¶
-
mavis.pairing.constants.
DEFAULTS
= WeakMavisNamespace(contig_call_distance=10, flanking_call_distance=50, input_call_distance=20, spanning_call_distance=20, split_call_distance=20)¶
main module¶
-
mavis.pairing.main.
main
(inputs, output, annotations, flanking_call_distance=50, split_call_distance=20, contig_call_distance=10, spanning_call_distance=20, start_time=1558720115, **kwargs)[source]¶ Parameters: - inputs (
List
ofstr
) – list of input files to read - output (str) – path to the output directory
- flanking_call_distance (int) – pairing distance for pairing with an event called by flanking read pair
- split_call_distance (int) – pairing distance for pairing with an event called by split read
- contig_call_distance (int) – pairing distance for pairing with an event called by contig or spanning read
- inputs (
pairing module¶
-
mavis.pairing.pairing.
equivalent
(event1, event2, distances=None)[source]¶ compares two events by breakpoint position to see if they are equivalent
-
mavis.pairing.pairing.
inferred_equivalent
(event1, event2, reference_transcripts, distances=None)[source]¶ comparison of events using product prediction and breakpoint prediction
-
mavis.pairing.pairing.
pair_by_distance
(calls, distances, log=<mavis.util.Log object>, against_self=False)[source]¶ for a set of input calls, pair by distance
-
mavis.pairing.pairing.
predict_transcriptome_breakpoint
(breakpoint, transcript)[source]¶ for a given genomic breakpoint and the target transcript. Predicts the possible transcriptomic breakpoints that would be expected based on the splicing model for abrogated splice sites
Parameters: - breakpoint (Breakpoint) – the genomic breakpoint
- transcript (PreTranscript) – the transcript
validate subpackage¶
Sub-package Documentation¶
The validation sub-package is responsible for pulling supporting reads from the bam file and re-calling events based on the evidence in a standard notation.
Types of Output Files¶
A variety of intermediate output files are given for the user. These can be used to “drill down” further into events and also for developers debugging when adding new features, etc.
expected name/suffix | file type/format | content |
---|---|---|
*.raw_evidence.bam |
bam | raw evidence |
*.contigs.bam |
bam | aligned contigs |
*.evidence.bed |
bed | evidence collection window regions |
*.validation-passed.bed |
bed | validated event positions |
*.validation-failed.tab |
text/tabbed | failed events |
*.validation-passed.tab |
text/tabbed | validated events |
*.contigs.fa |
fasta | assembled contigs |
*.contigs.blat_out.pslx |
pslx | results from blatting contigs |
*.igv.batch |
IGV batch file | igv batch file |
Algorithm Overview¶
(For each breakpoint pair)
- Calculate the window/region to read from the bam and collect evidence
- Store evidence (flanking read pair, half-mapped read, spanning read, split read, compatible flanking pairs) which match the expected event type and position
- Assemble a contig from the collected reads. see theory - assembling contigs
Generate a fasta file containing all the contig sequences
Align contigs to the reference genome (currently blat is used to perform this step)
Make the final event calls. Each level of calls consumes all supporting reads so they are not re-used in subsequent levels of calls.
(For each breakpoint pair)
- call by contig
- call by spanning read
- call by split read
- call by flanking read pair. see theory - calling breakpoints by flanking evidence
Output new calls, evidence, contigs, etc
modules
base module¶
-
class
mavis.validate.base.
Evidence
(break1, break2, bam_cache, reference_genome, read_length, stdev_fragment_size, median_fragment_size, stranded=False, opposing_strands=None, untemplated_seq=None, data={}, classification=None, **kwargs)[source]¶ Bases:
mavis.breakpoint.BreakpointPair
Parameters: - breakpoint_pair (BreakpointPair) – the breakpoint pair to collect evidence for
- bam_cache (BamCache) – the bam cache (and assc file) to collect evidence from
- reference_genome (
dict
ofBio.SeqRecord
bystr
) – dict of reference sequence by template/chr name - data (dict) – a dictionary of data to associate with the evidence object
- classification (SVTYPE) – the event type
- protocol (PROTOCOL) – genome or transcriptome
-
assemble_contig
(log=<mavis.util.Log object>)[source]¶ uses the split reads and the partners of the half mapped reads to create a contig representing the sequence across the breakpoints
if it is not strand specific then sequences are sorted alphanumerically and only the first of a pair is kept (paired by sequence)
-
collect_compatible_flanking_pair
(read, mate, compatible_type)[source]¶ checks if a given read meets the minimum quality criteria to be counted as evidence as stored as support for this event
Parameters: - read (pysam.AlignedSegment) – the read to add
- mate (pysam.AlignedSegment) – the mate
- compatible_type (SVTYPE) – the type we are collecting for
Returns: - True: the pair was collected and stored in the current evidence object
- False: the pair was not collected
Return type: Raises: ValueError
– if the input reads are not a valid pair
-
collect_flanking_pair
(read, mate)[source]¶ checks if a given read meets the minimum quality criteria to be counted as evidence as stored as support for this event
Parameters: - read (pysam.AlignedSegment) – the read to add
- mate (pysam.AlignedSegment) – the mate
Returns: - True: the pair was collected and stored in the current evidence object
- False: the pair was not collected
Return type: Raises: ValueError
– if the input reads are not a valid pair
-
collect_from_outer_window
()[source]¶ determines if evidence should be collected from the outer window (looking for flanking evidence) or should be limited to the inner window (split/spanning/contig only)
Returns: True or False Return type: bool
-
collect_half_mapped
(read, mate)[source]¶ Parameters: - read (pysam.AlignedSegment) – the read to add
- mate (pysam.AlignedSegment) – the unmapped mate
Returns: - True: the read was collected and stored in the current evidence object
- False: the read was not collected
Return type: Raises: AssertionError
– if the mate is not unmapped
-
collect_spanning_read
(read)[source]¶ spanning read: a read covering BOTH breakpoints
This is only applicable to small events. Do not need to look for soft clipped reads here since they will be collected already
Parameters: read (pysam.AlignedSegment) – the putative spanning read Returns: - True: the read was collected and stored in the current evidence object
- False: the read was not collected
Return type: bool
-
collect_split_read
(read, first_breakpoint)[source]¶ adds a split read if it passes the criteria filters and raises a warning if it does not
Parameters: - read (pysam.AlignedSegment) – the read to add
- first_breakpoint (bool) – add to the first breakpoint (or second if false)
Returns: - True: the read was collected and stored in the current evidence object
- False: the read was not collected
Return type: Raises: NotSpecifiedError
– if the breakpoint orientation is not specified
-
compatible_type
¶
-
compute_fragment_size
(read, mate)[source]¶ Parameters: - read (pysam.AlignedSegment) –
- mate (pysam.AlignedSegment) –
Returns: interval representing the range of possible fragment sizes for this read pair
Return type:
-
decide_sequenced_strand
(reads)[source]¶ given a set of reads, determines the sequenced strand (if possible) and then returns the majority strand found
Parameters: reads (set of pysam.AlignedSegment
) – set of readsReturns: the sequenced strand Return type: STRAND Raises: ValueError
– input was an empty set or the ratio was not sufficient to decide on a strand
-
flatten
()[source]¶ returns the key-value self for the breakpoint self information as can be written directly as a tab row
-
load_evidence
(log=<mavis.util.Log object>)[source]¶ open the associated bam file and read and store the evidence does some preliminary read-quality filtering
-
max_expected_fragment_size
¶
-
min_expected_fragment_size
¶
-
putative_event_types
()[source]¶ Returns: list of the possible classifications Return type: list of SVTYPE
call module¶
-
class
mavis.validate.call.
EventCall
(b1, b2, source_evidence, event_type, call_method, contig=None, contig_alignment=None, untemplated_seq=None)[source]¶ Bases:
mavis.breakpoint.BreakpointPair
class for holding evidence and the related calls since we can’t freeze the evidence object directly without a lot of copying. Instead we use call objects which are basically just a reference to the evidence object and decisions on class, exact breakpoints, etc
Parameters: - evidence (Evidence) – the evidence object we are calling based on
- event_type (SVTYPE) – the type of structural variant
- breakpoint_pair (BreakpointPair) – the breakpoint pair representing the exact breakpoints
- call_method (CALL_METHOD) – the way the breakpoints were called
- contig (Contig) – the contig used to call the breakpoints (if applicable)
-
add_break1_split_read
(read)[source]¶ Parameters: read (pysam.AlignedSegment) – putative split read supporting the first breakpoint
-
add_break2_split_read
(read)[source]¶ Parameters: read (pysam.AlignedSegment) – putative split read supporting the second breakpoint
-
add_flanking_support
(flanking_pairs, is_compatible=False)[source]¶ counts the flanking read-pair support for the event called. The original source evidence may have contained evidence for multiple events and uses a larger range so flanking pairs here are checked specifically against the current breakpoint call
Returns: Return type: tuple
-
add_spanning_read
(read)[source]¶ Parameters: read (pysam.AlignedSegment) – putative spanning read
-
static
characterize_repeat_region
(event, reference_genome)[source]¶ For a given event, determines the number of repeats the insertion/duplication/deletion is following. This is most useful in flagging homopolymer regions. Will raise a ValueError if the current event is not an expected type or is non-specific.
Returns: int
- the number of repeatsstr
- the repeat sequenceReturn type: tuple
-
complexity
()[source]¶ The sequence complexity for the call. If called by contig then the complexity of the contig sequence, otherwise an average of the sequence complexity of the support based on the call method
-
flanking_metrics
()[source]¶ computes the median and standard deviation of the flanking pairs. Note that standard deviation is calculated wrt the median and not the average. Also that the fragment size is calculated as a range so the start and end of the range are used in computing these metrics
Returns: float
- the median fragment sizefloat
- the fragment size standard deviation wrt the median
Return type: tuple
-
has_compatible
¶
-
is_supplementary
()[source]¶ check if the current event call was the target event given the source evidence object or an off-target call, i.e. something that was called as part of the original target. This is important b/c if the current event was not one of the original target it may not be fully investigated in other libraries
-
mavis.validate.call.
call_events
(source_evidence)[source]¶ generates a set of event calls based on the evidence associated with the source_evidence object will also narrow down the event type
Parameters: - source_evidence (Evidence) – the input evidence
- event_type (SVTYPE) – the type of event we are collecting evidence for
Returns: list of calls
Return type:
-
mavis.validate.call.
filter_consumed_pairs
(pairs, consumed_reads)[source]¶ given a set of read tuples, returns all tuples where neither read in the tuple is in the consumed set
Parameters: - pairs (set of tuples of
pysam.AlignedSegment
andpysam.AlignedSegment
) – pairs to be filtered - consumed_reads – (set of
pysam.AlignedSegment
): set of reads that have been used/consumed
Returns: set of filtered tuples
Return type: set of tuples of
pysam.AlignedSegment
andpysam.AlignedSegment
Note
this will work with any hash-able object
Example
>>> pairs = {(1, 2), (3, 4), (5, 6)} >>> consumed_reads = {1, 2, 4} >>> filter_consumed_pairs(pairs, consumed_reads) {(5, 6)}
- pairs (set of tuples of
constants module¶
-
mavis.validate.constants.
DEFAULTS
= WeakMavisNamespace(aligner='blat', assembly_kmer_size=0.74, assembly_max_paths=8, assembly_min_edge_trim_weight=3, assembly_min_exact_match_to_remap=15, assembly_min_remap_coverage=0.9, assembly_min_remapped_seq=3, assembly_min_uniq=0.1, assembly_strand_concordance=0.51, blat_limit_top_aln=10, blat_min_identity=0.9, call_error=10, clean_aligner_files=False, contig_aln_max_event_size=50, contig_aln_merge_inner_anchor=20, contig_aln_merge_outer_anchor=15, contig_aln_min_anchor_size=50, contig_aln_min_extend_overlap=10, contig_aln_min_query_consumption=0.9, contig_aln_min_score=0.9, fetch_min_bin_size=50, fetch_reads_bins=5, fetch_reads_limit=3000, filter_secondary_alignments=True, fuzzy_mismatch_number=1, max_sc_preceeding_anchor=6, min_anchor_exact=6, min_anchor_fuzzy=10, min_anchor_match=0.9, min_call_complexity=0.1, min_double_aligned_to_estimate_insertion_size=2, min_flanking_pairs_resolution=10, min_linking_split_reads=2, min_mapping_quality=5, min_non_target_aligned_split_reads=1, min_sample_size_to_apply_percentage=10, min_softclipping=6, min_spanning_reads_resolution=5, min_splits_reads_resolution=3, outer_window_min_event_size=125, stdev_count_abnormal=3.0, strand_determining_read=2, trans_fetch_reads_limit=12000, trans_min_mapping_quality=0, write_evidence_files=True)¶ - aligner
- assembly_kmer_size
- assembly_max_paths
- assembly_min_edge_trim_weight
- assembly_min_exact_match_to_remap
- assembly_min_remap_coverage
- assembly_min_remapped_seq
- assembly_min_uniq
- assembly_strand_concordance
- blat_limit_top_aln
- blat_min_identity
- call_error
- contig_aln_max_event_size
- contig_aln_merge_inner_anchor
- contig_aln_merge_outer_anchor
- contig_aln_min_anchor_size
- contig_aln_min_extend_overlap
- contig_aln_min_query_consumption
- contig_aln_min_score
- fetch_min_bin_size
- fetch_reads_bins
- fetch_reads_limit
- filter_secondary_alignments
- fuzzy_mismatch_number
- max_sc_preceeding_anchor
- min_anchor_exact
- min_anchor_fuzzy
- min_anchor_match
- min_double_aligned_to_estimate_insertion_size
- min_flanking_pairs_resolution
- min_linking_split_reads
- min_mapping_quality
- min_non_target_aligned_split_reads
- min_sample_size_to_apply_percentage
- min_softclipping
- min_spanning_reads_resolution
- min_splits_reads_resolution
- outer_window_min_event_size
- stdev_count_abnormal
- strand_determining_read
evidence module¶
-
class
mavis.validate.evidence.
GenomeEvidence
(*pos, **kwargs)[source]¶ Bases:
mavis.validate.base.Evidence
-
compute_fragment_size
(read, mate=None)[source]¶ Parameters: - read (pysam.AlignedSegment) –
- mate (pysam.AlignedSegment) –
Returns: interval representing the range of possible fragment sizes for this read pair
Return type:
-
generate_window
(breakpoint)[source]¶ given some input breakpoint uses the current evidence setting to determine an appropriate window/range of where one should search for supporting reads
Parameters: - breakpoint (Breakpoint) – the breakpoint we are generating the evidence window for
- read_length (int) – the read length
- call_error (int) – adds a buffer to the calculations if confidence in the breakpoint calls is low can increase this
Returns: the range where reads should be read from the bam looking for evidence for this event
Return type:
-
-
class
mavis.validate.evidence.
TranscriptomeEvidence
(annotations, *pos, **kwargs)[source]¶ Bases:
mavis.validate.base.Evidence
-
compute_fragment_size
(read, mate)[source]¶ Parameters: - read (pysam.AlignedSegment) –
- mate (pysam.AlignedSegment) –
Returns: interval representing the range of possible fragment sizes for this read pair
Return type:
-
distance
(start, end, strand='?', chrom=None)[source]¶ give the current list of transcripts, computes the putative exonic/intergenic distance given two genomic positions. Intronic positions are ignored
Intergenic calculations are only done if exonic only fails
-
exon_boundary_shift_cigar
(read)[source]¶ given an input read, converts deletions to N when the deletion matches the exon boundaries. Also shifts alignments to correspond to the exon boundaries where possible
-
generate_window
(breakpoint)[source]¶ given some input breakpoint uses the current evidence setting to determine an appropriate window/range of where one should search for supporting reads
Parameters: - breakpoint (Breakpoint) – the breakpoint we are generating the evidence window for
- annotations (dict of str and list of Gene) – the set of reference annotations: genes, transcripts, etc
- read_length (int) – the read length
- median_fragment_size (int) – the median insert size
- call_error (int) – adds a buffer to the calculations if confidence in the breakpoint calls is low can increase this
- stdev_fragment_size – the standard deviation away from the median for regular (non STV) read pairs
Returns: the range where reads should be read from the bam looking for evidence for this event
Return type:
-
main module¶
-
mavis.validate.main.
main
(inputs, output, bam_file, strand_specific, library, protocol, median_fragment_size, stdev_fragment_size, read_length, reference_genome, annotations, masking, aligner_reference, start_time=1558720115, **kwargs)[source]¶ Parameters: - inputs (list) – list of input files containing the breakpoint pairs
- output (str) – path to the output directory
- bam_file (str) – path the bam file
- strand_specific (bool) – flag to indicate the input bam is using a strand specific protocol
- median_fragment_size (int) – the median fragment size
- stdev_fragment_size (int) – the standard deviation in fragment size
- read_length (int) – read length
- reference_genome (
ReferenceFile
) – seeload_reference_genome()
- annotations (
ReferenceFile
) – seeload_reference_genes()
- masking (
ReferenceFile
) – seeload_masking_regions()
- aligner_reference (
ReferenceFile
) – path to the aligner reference file (e.g 2bit file for blat)
schedule subpackage¶
modules
constants module¶
-
mavis.schedule.constants.
MAIL_TYPE
= MavisNamespace(ALL='ALL', BEGIN='BEGIN', END='END', FAIL='FAIL', NONE='NONE')¶ When the scheduler should notify mail_user about a job
ALL
- All other options (except none)BEGIN
- Send an email when the job startsEND
- Send an email when the job has terminatedFAIL
- Send an email if the job failsNONE
- Do not send email
-
mavis.schedule.constants.
OPTIONS
= WeakMavisNamespace(annotation_memory=12000, concurrency_limit=None, import_env=True, mail_type='NONE', mail_user='', memory_limit=16000, queue='', remote_head_ssh='', scheduler='SLURM', time_limit=57600, trans_validation_memory=18000, validation_memory=16000)¶ submission options
- annotation_memory
- concurrency_limit
- import_env
- mail_type
- mail_user
- memory_limit
- queue
- remote_head_ssh
- scheduler
- time_limit
- trans_validation_memory
- validation_memory
Type: MavisNamespace
-
mavis.schedule.constants.
SCHEDULER
= MavisNamespace(LOCAL='LOCAL', SGE='SGE', SLURM='SLURM', TORQUE='TORQUE')¶ scheduler types
Type: MavisNamespace
job module¶
-
class
mavis.schedule.job.
ArrayJob
(stage, task_list, **kwargs)[source]¶ Bases:
mavis.schedule.job.Job
Class for dealing with array jobs. Jobs with many tasks
Parameters: task_list ( list
orint
) – the ids of tasks in the job array-
logfile
(task_ident)[source]¶ returns the path to the logfile with job name and job id substituted into the stdout pattern
-
tasks
¶
-
-
class
mavis.schedule.job.
Job
(stage, output_dir, stdout=None, job_ident=None, name=None, dependencies=None, script=None, created_at=None, status='NOT SUBMITTED', status_comment='', **options)[source]¶ Bases:
object
Parameters: - stage (str) – the mavis pipleine stage this job belongs to
- job_ident (int) – the job number/id according to the scheduler being used
- output_dir (str) – path to the output directory where logs/stamps for this job will be written
- name (str) – the job name according to the scheduler being used
- dependencies (list of Job) – list of jobs which must complete for this job to run
- stdout (str) – basename of the file to write std output to
- script (str) – path to the script which contains the commands for the job
- created_at (int) – the time stamp for when the job was created (created != submitted)
- status (JOB_STATUS) – The current (since last checked) status of the job
- status_comment (str) – the comment which describes the status, generally this is used for reporting errors from the log file or failed dependencies (SLURM)
- options (**dict) – override default options specified by OPTIONS
-
display_name
¶ Used for identifying this job in an ini config file
-
class
mavis.schedule.job.
LogFile
(filename, status, message=None)[source]¶ Bases:
object
stores information about the log status
Parameters: -
STATUS
= MavisNamespace(COMPLETE='COMPLETE', CRASH='CRASH', EMPTY='EMPTY', INCOMPLETE='INCOMPLETE')¶ The status of the job based on parsing of the logfile
Type: MavisNamespace
-
-
class
mavis.schedule.job.
TorqueArrayJob
(stage, task_list, **kwargs)[source]¶ Bases:
mavis.schedule.job.ArrayJob
Parameters: task_list ( list
orint
) – the ids of tasks in the job array
local module¶
-
class
mavis.schedule.local.
LocalJob
(args, func, rank=None, response=None, *pos, **kwargs)[source]¶ Bases:
mavis.schedule.job.Job
Parameters:
-
class
mavis.schedule.local.
LocalScheduler
(*pos, **kwargs)[source]¶ Bases:
mavis.schedule.scheduler.Scheduler
Scheduler class for dealing with running mavis locally
pipeline module¶
-
class
mavis.schedule.pipeline.
Pipeline
(output_dir, scheduler, validations=None, annotations=None, pairing=None, summary=None, checker=None, batch_id='batch-9MDHZiENYtgbRM68QdRtXG')[source]¶ Bases:
object
Parameters: - output_dir (str) – path to main output directory for all mavis pipeline results
- scheduler (Scheduler) – the class for interacting with a job scheduler
- validations (
list
ofJob
) – list of validation jobs - annotations (
list
ofJob
) – list of annotation jobs - pairing (Job) – pairing job
- summary (Job) – summary job
- batch_id (str) – the batch id for this pipeline run. Used in avoinfing job name conflicts
-
ERROR_STATES
= {'CANCELLED', 'ERROR', 'FAILED', 'NOT SUBMITTED', 'UNKNOWN'}¶
-
classmethod
build
(config)[source]¶ Parameters: config (MavisConfig) – the main configuration. Note this is the config after all reference inputs have been loaded Returns: the pipeline instance with job dependencies information etc. Return type: Pipeline
-
check_status
(submit=False, resubmit=False, log=<mavis.util.Log object>)[source]¶ Check all jobs for completetion. Report any failures, etc.
Parameters: submit (bool) – submit any pending jobs
-
classmethod
read_build_file
(filepath)[source]¶ read the configuration file which stored the build information concerning jobs and dependencies
Parameters: filepath (str) – path to the input config file
-
mavis.schedule.pipeline.
annotate_args
(config, libconf)[source]¶ Pull arguments from the main config and library specific config to pass to annotate
Parameters: - config (MavisConfig) – the main program config
- libconf (LibraryConfig) – library specific configuration
-
mavis.schedule.pipeline.
cluster_args
(config, libconf)[source]¶ Pull arguments from the main config and library specific config to pass to cluster
Parameters: - config (MavisConfig) – the main program config
- libconf (LibraryConfig) – library specific configuration
-
mavis.schedule.pipeline.
parse_run_time
(filename)[source]¶ parses the run time listed at the end of a file following mavis conventions
-
mavis.schedule.pipeline.
run_conversion
(config, libconf, conversion_dir, assume_no_untemplated=True)[source]¶ Converts files if not already converted. Returns a list of filenames
-
mavis.schedule.pipeline.
stringify_args_to_command
(args)[source]¶ takes a list of arguments and prepares them for writing to a bash script
-
mavis.schedule.pipeline.
summary_args
(config)[source]¶ Pull arguments from the main config and library specific config to pass to summary
Parameters: - config (MavisConfig) – the main program config
- libconf (LibraryConfig) – library specific configuration
-
mavis.schedule.pipeline.
validate_args
(config, libconf)[source]¶ Pull arguments from the main config and library specific config to pass to validate
Parameters: - config (MavisConfig) – the main program config
- libconf (LibraryConfig) – library specific configuration
scheduler module¶
-
class
mavis.schedule.scheduler.
Scheduler
(concurrency_limit=None, remote_head_ssh='')[source]¶ Bases:
object
Class responsible for methods interacting with the scheduler
Parameters: concurrency_limit (int) – the maximum allowed concurrent processes. Defaults to one less than the total number available -
ENV_JOB_IDENT
= '{JOB_IDENT}'¶ the expected pattern of environment variables which store the job id
Type: str
-
ENV_TASK_IDENT
= '{TASK_IDENT}'¶ the expected pattern of environment variables which store the task id
Type: str
-
HEADER_PREFIX
= '#'¶
-
command
(command, shell=False)[source]¶ Wrapper to deal with subprocess commands. If configured and not on the head node currently, will send the command through ssh
Parameters: command (list or str) – the command can be a list or a string and is passed to the subprocess to be run Returns: the content returns from stdout of the subprocess Return type: str
-
-
class
mavis.schedule.scheduler.
SgeScheduler
(concurrency_limit=None, remote_head_ssh='')[source]¶ Bases:
mavis.schedule.scheduler.Scheduler
Class for managing interactions with the SGE scheduler
Parameters: concurrency_limit (int) – the maximum allowed concurrent processes. Defaults to one less than the total number available -
ENV_ARRAY_IDENT
= 'JOB_ID'¶
-
ENV_JOB_IDENT
= 'JOB_ID'¶ expected pattern for environment variables which store the job id
Type: str
-
ENV_JOB_NAME
= 'JOB_NAME'¶ expected pattern for environment variables which store the job name
Type: str
-
ENV_TASK_IDENT
= 'SGE_TASK_ID'¶
-
HEADER_PREFIX
= '#$'¶
-
MAIL_TYPE_MAPPING
= {'ALL': 'abes', 'BEGIN': 'b', 'END': 'e', 'FAIL': 'as', 'NONE': 'n'}¶ mapping from MAVIS mail type options to SGE mail options
Type: dict
-
STATE_MAPPING
= {'E': 'ERROR', 'R': 'RUNNING', 'T': 'ERROR', 'd': 'CANCELLED', 'h': 'PENDING', 'q': 'PENDING', 'r': 'RUNNING', 's': 'ERROR', 't': 'RUNNING', 'w': 'PENDING'}¶ mapping from SGE job states to their MAVIS JOB_STATUS equivalent
Type: dict
-
classmethod
parse_qacct
(content)[source]¶ parses the information produced by qacct
Parameters: content (str) – the content returned from the qacct command - Raises
- ValueError: when no job information is reported (this may happen due to a bad or too old job ID where information is no longer stored)
-
classmethod
parse_qstat
(content, job_id)[source]¶ parses the qstat content into rows/dicts representing individual jobs
Parameters: content (str) – content returned from the qstat command
-
-
class
mavis.schedule.scheduler.
SlurmScheduler
(concurrency_limit=None, remote_head_ssh='')[source]¶ Bases:
mavis.schedule.scheduler.Scheduler
Class for formatting commands to match a SLURM scheduler system SLURM docs can be found here https://slurm.schedmd.com
Parameters: concurrency_limit (int) – the maximum allowed concurrent processes. Defaults to one less than the total number available -
ENV_ARRAY_IDENT
= 'SLURM_ARRAY_JOB_ID'¶
-
ENV_JOB_IDENT
= 'SLURM_JOB_ID'¶
-
ENV_TASK_IDENT
= 'SLURM_ARRAY_TASK_ID'¶
-
format_dependencies
(job)[source]¶ returns a string representing the dependency argument
Parameters: job (Job) – the job the argument is being built for
-
classmethod
parse_sacct
(content)[source]¶ parses content returned from the sacct command
Parameters: content (str) – the content returned from the sacct command
-
classmethod
parse_scontrol_show
(content)[source]¶ parse the content from the command: scontrol show job <JOBID>
Parameters: content (str) – the content to be parsed
-
-
class
mavis.schedule.scheduler.
TorqueScheduler
(concurrency_limit=None, remote_head_ssh='')[source]¶ Bases:
mavis.schedule.scheduler.SgeScheduler
Class for managing interactions with the Torque scheduler
Parameters: concurrency_limit (int) – the maximum allowed concurrent processes. Defaults to one less than the total number available -
ENV_ARRAY_IDENT
= 'PBS_JOBID'¶
-
ENV_JOB_IDENT
= 'PBS_JOBID'¶ expected pattern for environment variables which store the job id
Type: str
-
ENV_JOB_NAME
= 'PBS_JOBNAME'¶ expected pattern for environment variables which store the job name
Type: str
-
ENV_TASK_IDENT
= 'PBS_ARRAYID'¶
-
MAIL_TYPE_MAPPING
= {'ALL': 'abef', 'BEGIN': 'b', 'END': 'e', 'FAIL': 'fa', 'NONE': 'p'}¶ mapping from MAVIS mail type options to Torque mail options
Type: dict
-
STATE_MAPPING
= {'C': 'COMPLETED', 'E': 'RUNNING', 'H': 'PENDING', 'Q': 'PENDING', 'R': 'RUNNING', 'S': 'ERROR', 'T': 'RUNNING', 'W': 'PENDING'}¶ mapping from Torque job states to their MAVIS JOB_STATUS equivalent
Type: dict
-
TAB_SIZE
= 8¶
-
summary subpackage¶
Sub-package Documentation¶
This is the package responsible for summarizing the calls between libraries. In many cases this will be where somatic vs germline is determined or genomic only vs expressed.
Output Files¶
expected name/suffix | file type/format | content |
---|---|---|
mavis_summary_*.tab |
text/tabbed | ? |
Algorithm Overview¶
TODO
modules
main module¶
-
mavis.summary.main.
main
(inputs, output, annotations, dgv_annotation=None, filter_cdna_synon=True, filter_protein_synon=False, filter_min_remapped_reads=5, filter_min_spanning_reads=5, filter_min_flanking_reads=10, filter_min_split_reads=5, filter_trans_homopolymers=True, filter_min_linking_split_reads=1, filter_min_complexity=0.2, flanking_call_distance=50, split_call_distance=20, contig_call_distance=10, spanning_call_distance=20, start_time=1558720115, **kwargs)[source]¶
summary module¶
-
mavis.summary.summary.
annotate_dgv
(bpps, dgv_regions_by_reference_name, distance=0)[source]¶ given a list of bpps and a dgv reference, annotate the events that are within the set distance of both breakpoints
Parameters:
-
mavis.summary.summary.
filter_by_annotations
(bpp_list, best_transcripts)[source]¶ Parameters: - bpp_list (list of BreakpointPair) – list of pairs to filter
- ( (best_transcripts) – class dict of
Transcript
bystr
): the best transcripts of the annotations based on their names
-
mavis.summary.summary.
filter_by_call_method
(bpp_list)[source]¶ Filters a set of breakpoint pairs to returns the call with the most evidence. Prefers contig evidence over spanning over split over flanking, etc.
-
mavis.summary.summary.
filter_by_evidence
(bpps, filter_min_remapped_reads=5, filter_min_spanning_reads=5, filter_min_flanking_reads=10, filter_min_split_reads=5, filter_min_linking_split_reads=1)[source]¶
-
mavis.summary.summary.
get_pairing_state
(current_protocol, current_disease_state, other_protocol, other_disease_state, is_matched=False, inferred_is_matched=False)[source]¶ given two libraries, returns the appropriate descriptor for their matched state
Parameters: - current_protocol (PROTOCOL) – the protocol of the current library
- current_disease_state (DISEASE_STATUS) – the disease status of the current library
- other_protocol (PROTOCOL) – protocol of the library being comparing to
- other_disease_state (DISEASE_STATUS) – disease status of the library being compared to
- is_matched (bool) – True if the libraries are paired
Returns: descriptor of the pairing of the two libraries
Return type: (PAIRING_STATE)
tools module¶
-
mavis.tools.
SUPPORTED_TOOL
= MavisNamespace(BREAKDANCER='breakdancer', BREAKSEQ='breakseq', CHIMERASCAN='chimerascan', CNVNATOR='cnvnator', DEFUSE='defuse', DELLY='delly', MANTA='manta', MAVIS='mavis', PINDEL='pindel', STARFUSION='starfusion', STRELKA='strelka', TA='transabyss', VCF='vcf')¶ Supported Tools used to call SVs and then used as input into MAVIS
- chimerascan [Iyer-2011]
- defuse [McPherson-2011]
- delly [Rausch-2012]
- manta [Chen-2016]
- pindel [Ye-2009]
- transabyss [Robertson-2010]
util module¶
-
class
mavis.util.
Log
(indent_str=' ', indent_level=0, level=20)[source]¶ Bases:
object
wrapper aroung the builtin logging to make it more readable
-
mavis.util.
bash_expands
(*expressions)[source]¶ expand a file glob expression, allowing bash-style brackets.
Returns: a list of files Return type: list Example
>>> bash_expands('./{test,doc}/*py') [...]
-
mavis.util.
cast
(value, cast_func)[source]¶ cast a value to a given type
Example
>>> cast('1', int) 1
-
mavis.util.
filter_on_overlap
(bpps, regions_by_reference_name)[source]¶ filter a set of breakpoint pairs based on overlap with a set of genomic regions
Parameters: - bpps (
list
ofBreakpointPair
) – list of breakpoint pairs to be filtered - regions_by_reference_name (
dict
oflist
ofBioInterval
bystr
) – regions to filter against
- bpps (
-
mavis.util.
generate_complete_stamp
(output_dir, log=<mavis.util.Log object>, prefix='MAVIS.', start_time=None)[source]¶ writes a complete stamp, optionally including the run time if start_time is given
Parameters: Returns: path to the complete stamp
Return type: Example
>>> generate_complete_stamp('some_output_dir') 'some_output_dir/MAVIS.COMPLETE'
-
mavis.util.
get_connected_components
(adj_matrix)[source]¶ for a dictionary representing an adjacency matrix of undirected edges returns the connected components
-
mavis.util.
get_env_variable
(arg, default, cast_type=None)[source]¶ Parameters: arg (str) – the argument/variable name Returns: the setting from the environment variable if given, otherwise the default value
-
mavis.util.
log_arguments
(args)[source]¶ output the arguments to the console
Parameters: args (Namespace) – the namespace to print arguments for
-
mavis.util.
mkdirp
(dirname)[source]¶ Make a directory or path of directories. Suppresses the error that is normally raised when the directory already exists
-
mavis.util.
read_bpp_from_input_file
(filename, expand_orient=False, expand_strand=False, expand_svtype=False, **kwargs)[source]¶ reads a file using the tab module. Each row is converted to a breakpoint pair and other column data is stored in the data attribute
Parameters: Returns: a list of pairs
Return type: Example
>>> read_bpp_from_input_file('filename') [BreakpointPair(), BreakpointPair(), ...]
One can also validate other expected columns that will go in the data attribute using the usual arguments to the tab.read_file function
>>> read_bpp_from_input_file('filename', cast={'index': int}) [BreakpointPair(), BreakpointPair(), ...]
Glossary¶
General Terms¶
- 2bit
- File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format7.
- BAM
- File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format5.1.
- bed
- File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format1.
- blat
- Alignment tool. see https://genome.ucsc.edu/FAQ/FAQblat.html#blat3 for instructions on download and install.
- BreakDancer
- BreakDancer is an SV caller. Source for BreakDancer can be found here [Chen-2009]
- breakpoint
- A breakpoint is a genomic position (interval) on some reference/template/chromosome which has a strand and orientation. The orientation describes the portion of the reference that is retained.
- breakpoint pair
- Basic definition of a structural variant. Does not automatically imply a classification/type.
- BreakSeq
- BreakSeq is an SV caller. Source for BreakSeq can be found here [Abyzov-2015]
- BWA
- BWA is an alignement tool. See https://github.com/lh3/bwa for install instructions.
- Chimerascan
- Chimerascan is an SV caller. Source for Chimerascan can be found here [Iyer-2011]
- CNVnator
- CNVnator is an SV caller. Source for CNVnator can be found here [Abyzov-2011]
- DeFuse
- DeFuse is an SV caller. Source for DeFuse can be found here [McPherson-2011]
- DELLY
- DELLY is an SV caller. Source for DELLY can be found here [Rausch-2012]
- event
- Used interchangeably with structural variant.
- event type
- Classification for a structural variant. see event_type.
- fasta
- File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format18.
- flanking read pair
- A pair of reads where one read maps to one side of a set of breakpoints and its mate maps to the other.
- half-mapped read
- A read whose mate is unaligned. Generally this refers to reads in the evidence stage that are mapped next to a breakpoint.
- HGVS
- Community based standard of reccommendations for variant notation. See http://varnomen.hgvs.org/
- IGV
- Integrative Genomics Viewer is a visualization tool. see http://software.broadinstitute.org/software/igv.
- IGV batch file
- This is a file format type defined by IGV see running IGV with a batch file.
- JSON
- JSON (JavaScript Object Notation) is a data file format. see https://www.w3schools.com/js/js_json_intro.asp.
- Manta
- Manta is an SV caller. Source for Manta can be found here [Chen-2016]
- Pindel
- Pindel is an SV caller. Source for Pindel can be found here [Ye-2009]
- psl
- File format specification. See https://genome.ucsc.edu/FAQ/FAQformat#format2.
- pslx
- Extended format of a psl.
- SGE
- Sun Grid Engine (SGE) is a job scheduling system for cluster management see http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html.
- SLURM
- SLURM is a job scheduling system for cluster management see https://slurm.schedmd.com/archive/slurm-17.02.1.
- spanning read
- Applies primarily to small structural variants. Reads which span both breakpoints.
- split read
- A read which aligns next to a breakpoint and is softclipped at one or more sides.
- STAR-Fusion
- STAR-Fusion is an SV caller. Source for STAR-Fusion can be found here [Haas-2017]
- Strelka
- Strelka is an SNV and small indel caller. Only the small indels can be processed, since SNVs are not currently suported. Source for Strelka can be found here [Saunders-2012]
- structural variant
- A genomic alteration that can be described by a pair of breakpoints and an event type. The two breakpoints represent regions in the genome that are broken apart and reattached together.
- SV
- Structural Variant
- SVG
- SVG (Scalable vector graph) is an image format. see https://www.w3schools.com/graphics/svg_intro.asp.
- TORQUE
- TORQUE is a job scheduling system for cluster management see http://www.adaptivecomputing.com/products/open-source/torque/.
- Trans-ABySS
- Trans-ABySS is an SV caller. Source for Trans-ABySS can be found here [Robertson-2010]
Configurable Settings¶
- aligner
SUPPORTED_ALIGNER
- The aligner to use to map the contigs/reads back to the reference e.g blat or bwa. The corresponding environment variable isMAVIS_ALIGNER
and the default value is'blat'
. Accepted values include:'bwa mem'
,'blat'
- aligner_reference
filepath
- Path to the aligner reference file used for aligning the contig sequences. The corresponding environment variable isMAVIS_ALIGNER_REFERENCE
and the default value isNone
- annotation_filters
str
- A comma separated list of filters to apply to putative annotations. The corresponding environment variable isMAVIS_ANNOTATION_FILTERS
and the default value is'choose_more_annotated,choose_transcripts_by_priority'
- annotation_memory
int
- Default memory limit (mb) for the annotation stage. The corresponding environment variable isMAVIS_ANNOTATION_MEMORY
and the default value is12000
- annotations
filepath
- Path to the reference annotations of genes, transcript, exons, domains, etc. The corresponding environment variable isMAVIS_ANNOTATIONS
and the default value is[]
- assembly_kmer_size
float_fraction
- The percent of the read length to make kmers for assembly. The corresponding environment variable isMAVIS_ASSEMBLY_KMER_SIZE
and the default value is0.74
- assembly_max_paths
int
- The maximum number of paths to resolve. this is used to limit when there is a messy assembly graph to resolve. the assembly will pre-calculate the number of paths (or putative assemblies) and stop if it is greater than the given setting. The corresponding environment variable isMAVIS_ASSEMBLY_MAX_PATHS
and the default value is8
- assembly_min_edge_trim_weight
int
- This is used to simplify the debruijn graph before path finding. edges with less than this frequency will be discarded if they are non-cutting, at a fork, or the end of a path. The corresponding environment variable isMAVIS_ASSEMBLY_MIN_EDGE_TRIM_WEIGHT
and the default value is3
- assembly_min_exact_match_to_remap
int
- The minimum length of exact matches to initiate remapping a read to a contig. The corresponding environment variable isMAVIS_ASSEMBLY_MIN_EXACT_MATCH_TO_REMAP
and the default value is15
- assembly_min_remap_coverage
float_fraction
- Minimum fraction of the contig sequence which the remapped sequences must align over. The corresponding environment variable isMAVIS_ASSEMBLY_MIN_REMAP_COVERAGE
and the default value is0.9
- assembly_min_remapped_seq
int
- The minimum input sequences that must remap for an assembled contig to be used. The corresponding environment variable isMAVIS_ASSEMBLY_MIN_REMAPPED_SEQ
and the default value is3
- assembly_min_uniq
float_fraction
- Minimum percent uniq required to keep separate assembled contigs. if contigs are more similar then the lower scoring, then shorter, contig is dropped. The corresponding environment variable isMAVIS_ASSEMBLY_MIN_UNIQ
and the default value is0.1
- assembly_strand_concordance
float_fraction
- When the number of remapped reads from each strand are compared, the ratio must be above this number to decide on the strand. The corresponding environment variable isMAVIS_ASSEMBLY_STRAND_CONCORDANCE
and the default value is0.51
- blat_limit_top_aln
int
- Number of results to return from blat (ranking based on score). The corresponding environment variable isMAVIS_BLAT_LIMIT_TOP_ALN
and the default value is10
- blat_min_identity
float_fraction
- The minimum percent identity match required for blat results when aligning contigs. The corresponding environment variable isMAVIS_BLAT_MIN_IDENTITY
and the default value is0.9
- breakpoint_color
str
- Breakpoint outline color. The corresponding environment variable isMAVIS_BREAKPOINT_COLOR
and the default value is'#000000'
- call_error
int
- Buffer zone for the evidence window. The corresponding environment variable isMAVIS_CALL_ERROR
and the default value is10
- clean_aligner_files
bool
- Remove the aligner output files after the validation stage is complete. not required for subsequent steps but can be useful in debugging and deep investigation of events. The corresponding environment variable isMAVIS_CLEAN_ALIGNER_FILES
and the default value isFalse
- cluster_initial_size_limit
int
- The maximum cumulative size of both breakpoints for breakpoint pairs to be used in the initial clustering phase (combining based on overlap). The corresponding environment variable isMAVIS_CLUSTER_INITIAL_SIZE_LIMIT
and the default value is25
- cluster_radius
int
- Maximum distance allowed between paired breakpoint pairs. The corresponding environment variable isMAVIS_CLUSTER_RADIUS
and the default value is100
- concurrency_limit
int
- The concurrency limit for tasks in any given job array or the number of concurrent processes allowed for a local run. The corresponding environment variable isMAVIS_CONCURRENCY_LIMIT
and the default value isNone
- contig_aln_max_event_size
int
- Relates to determining breakpoints when pairing contig alignments. for any given read in a putative pair the soft clipping is extended to include any events of greater than this size. the softclipping is added to the side of the alignment as indicated by the breakpoint we are assigning pairs to. The corresponding environment variable isMAVIS_CONTIG_ALN_MAX_EVENT_SIZE
and the default value is50
- contig_aln_merge_inner_anchor
int
- The minimum number of consecutive exact match base pairs to not merge events within a contig alignment. The corresponding environment variable isMAVIS_CONTIG_ALN_MERGE_INNER_ANCHOR
and the default value is20
- contig_aln_merge_outer_anchor
int
- Minimum consecutively aligned exact matches to anchor an end for merging internal events. The corresponding environment variable isMAVIS_CONTIG_ALN_MERGE_OUTER_ANCHOR
and the default value is15
- contig_aln_min_anchor_size
int
- The minimum number of aligned bases for a contig (m or =) in order to simplify. do not have to be consecutive. The corresponding environment variable isMAVIS_CONTIG_ALN_MIN_ANCHOR_SIZE
and the default value is50
- contig_aln_min_extend_overlap
int
- Minimum number of bases the query coverage interval must be extended by in order to pair alignments as a single split alignment. The corresponding environment variable isMAVIS_CONTIG_ALN_MIN_EXTEND_OVERLAP
and the default value is10
- contig_aln_min_query_consumption
float_fraction
- Minimum fraction of the original query sequence that must be used by the read(s) of the alignment. The corresponding environment variable isMAVIS_CONTIG_ALN_MIN_QUERY_CONSUMPTION
and the default value is0.9
- contig_aln_min_score
float_fraction
- Minimum score for a contig to be used as evidence in a call by contig. The corresponding environment variable isMAVIS_CONTIG_ALN_MIN_SCORE
and the default value is0.9
- contig_call_distance
int
- The maximum distance allowed between breakpoint pairs (called by contig) in order for them to pair. The corresponding environment variable isMAVIS_CONTIG_CALL_DISTANCE
and the default value is10
- dgv_annotation
filepath
- Path to the dgv reference processed to look like the cytoband file. The corresponding environment variable isMAVIS_DGV_ANNOTATION
and the default value is[]
- domain_color
str
- Domain fill color. The corresponding environment variable isMAVIS_DOMAIN_COLOR
and the default value is'#ccccb3'
- domain_mismatch_color
str
- Domain fill color on 0%% match. The corresponding environment variable isMAVIS_DOMAIN_MISMATCH_COLOR
and the default value is'#b2182b'
- domain_name_regex_filter
str
- The regular expression used to select domains to be displayed (filtered by name). The corresponding environment variable isMAVIS_DOMAIN_NAME_REGEX_FILTER
and the default value is'^PF\\d+$'
- domain_scaffold_color
str
- The color of the domain scaffold. The corresponding environment variable isMAVIS_DOMAIN_SCAFFOLD_COLOR
and the default value is'#000000'
- draw_fusions_only
bool
- Flag to indicate if events which do not produce a fusion transcript should produce illustrations. The corresponding environment variable isMAVIS_DRAW_FUSIONS_ONLY
and the default value isTrue
- draw_non_synonymous_cdna_only
bool
- Flag to indicate if events which are synonymous at the cdna level should produce illustrations. The corresponding environment variable isMAVIS_DRAW_NON_SYNONYMOUS_CDNA_ONLY
and the default value isTrue
- drawing_width_iter_increase
int
- The amount (in pixels) by which to increase the drawing width upon failure to fit. The corresponding environment variable isMAVIS_DRAWING_WIDTH_ITER_INCREASE
and the default value is500
- exon_min_focus_size
int
- Minimum size of an exon for it to be granted a label or min exon width. The corresponding environment variable isMAVIS_EXON_MIN_FOCUS_SIZE
and the default value is10
- fetch_min_bin_size
int
- The minimum size of any bin for reading from a bam file. increasing this number will result in smaller bins being merged or less bins being created (depending on the fetch method). The corresponding environment variable isMAVIS_FETCH_MIN_BIN_SIZE
and the default value is50
- fetch_reads_bins
int
- Number of bins to split an evidence window into to ensure more even sampling of high coverage regions. The corresponding environment variable isMAVIS_FETCH_READS_BINS
and the default value is5
- fetch_reads_limit
int
- Maximum number of reads, cap, to loop over for any given evidence window. The corresponding environment variable isMAVIS_FETCH_READS_LIMIT
and the default value is3000
- filter_cdna_synon
bool
- Filter all annotations synonymous at the cdna level. The corresponding environment variable isMAVIS_FILTER_CDNA_SYNON
and the default value isTrue
- filter_min_complexity
float_fraction
- Filter event calls based on call sequence complexity. The corresponding environment variable isMAVIS_FILTER_MIN_COMPLEXITY
and the default value is0.2
- filter_min_flanking_reads
int
- Minimum number of flanking pairs for a call by flanking pairs. The corresponding environment variable isMAVIS_FILTER_MIN_FLANKING_READS
and the default value is10
- filter_min_linking_split_reads
int
- Minimum number of linking split reads for a call by split reads. The corresponding environment variable isMAVIS_FILTER_MIN_LINKING_SPLIT_READS
and the default value is1
- filter_min_remapped_reads
int
- Minimum number of remapped reads for a call by contig. The corresponding environment variable isMAVIS_FILTER_MIN_REMAPPED_READS
and the default value is5
- filter_min_spanning_reads
int
- Minimum number of spanning reads for a call by spanning reads. The corresponding environment variable isMAVIS_FILTER_MIN_SPANNING_READS
and the default value is5
- filter_min_split_reads
int
- Minimum number of split reads for a call by split reads. The corresponding environment variable isMAVIS_FILTER_MIN_SPLIT_READS
and the default value is5
- filter_protein_synon
bool
- Filter all annotations synonymous at the protein level. The corresponding environment variable isMAVIS_FILTER_PROTEIN_SYNON
and the default value isFalse
- filter_secondary_alignments
bool
- Filter secondary alignments when gathering read evidence. The corresponding environment variable isMAVIS_FILTER_SECONDARY_ALIGNMENTS
and the default value isTrue
- filter_trans_homopolymers
bool
- Filter all single bp ins/del/dup events that are in a homopolymer region of at least 3 bps and are not paired to a genomic event. The corresponding environment variable isMAVIS_FILTER_TRANS_HOMOPOLYMERS
and the default value isTrue
- flanking_call_distance
int
- The maximum distance allowed between breakpoint pairs (called by flanking pairs) in order for them to pair. The corresponding environment variable isMAVIS_FLANKING_CALL_DISTANCE
and the default value is50
- fuzzy_mismatch_number
int
- The number of events/mismatches allowed to be considered a fuzzy match. The corresponding environment variable isMAVIS_FUZZY_MISMATCH_NUMBER
and the default value is1
- gene1_color
str
- The color of genes near the first gene. The corresponding environment variable isMAVIS_GENE1_COLOR
and the default value is'#657e91'
- gene1_color_selected
str
- The color of the first gene. The corresponding environment variable isMAVIS_GENE1_COLOR_SELECTED
and the default value is'#518dc5'
- gene2_color
str
- The color of genes near the second gene. The corresponding environment variable isMAVIS_GENE2_COLOR
and the default value is'#325556'
- gene2_color_selected
str
- The color of the second gene. The corresponding environment variable isMAVIS_GENE2_COLOR_SELECTED
and the default value is'#4c9677'
- import_env
bool
- Flag to import environment variables. The corresponding environment variable isMAVIS_IMPORT_ENV
and the default value isTrue
- input_call_distance
int
- The maximum distance allowed between breakpoint pairs (called by input tools, not validated) in order for them to pair. The corresponding environment variable isMAVIS_INPUT_CALL_DISTANCE
and the default value is20
- label_color
str
- The label color. The corresponding environment variable isMAVIS_LABEL_COLOR
and the default value is'#000000'
- limit_to_chr
str
- A list of chromosome names to use. breakpointpairs on other chromosomes will be filteredout. for example ‘1 2 3 4’ would filter out events/breakpoint pairs on any chromosomes but 1, 2, 3, and 4. The corresponding environment variable isMAVIS_LIMIT_TO_CHR
and the default value is['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']
- mail_type
MAIL_TYPE
- When to notify the mail_user (if given). The corresponding environment variable isMAVIS_MAIL_TYPE
and the default value is'NONE'
. Accepted values include:'BEGIN'
,'END'
,'FAIL'
,'ALL'
,'NONE'
- mail_user
str
- User(s) to send notifications to. The corresponding environment variable isMAVIS_MAIL_USER
and the default value is''
- mask_fill
str
- Color of mask (for deleted region etc.). The corresponding environment variable isMAVIS_MASK_FILL
and the default value is'#ffffff'
- mask_opacity
float_fraction
- Opacity of the mask layer. The corresponding environment variable isMAVIS_MASK_OPACITY
and the default value is0.7
- masking
filepath
- File containing regions for which input events overlapping them are dropped prior to validation. The corresponding environment variable isMAVIS_MASKING
and the default value is[]
- max_drawing_retries
int
- The maximum number of retries for attempting a drawing. each iteration the width is extended. if it is still insufficient after this number a gene-level only drawing will be output. The corresponding environment variable isMAVIS_MAX_DRAWING_RETRIES
and the default value is5
- max_files
int
- The maximum number of files to output from clustering/splitting. The corresponding environment variable isMAVIS_MAX_FILES
and the default value is200
- max_orf_cap
int
- The maximum number of orfs to return (best putative orfs will be retained). The corresponding environment variable isMAVIS_MAX_ORF_CAP
and the default value is3
- max_proximity
int
- The maximum distance away from an annotation before the region in considered to be uninformative. The corresponding environment variable isMAVIS_MAX_PROXIMITY
and the default value is5000
- max_sc_preceeding_anchor
int
- When remapping a softclipped read this determines the amount of softclipping allowed on the side opposite of where we expect it. for example for a softclipped read on a breakpoint with a left orientation this limits the amount of softclipping that is allowed on the right. if this is set to none then there is no limit on softclipping. The corresponding environment variable isMAVIS_MAX_SC_PRECEEDING_ANCHOR
and the default value is6
- memory_limit
int
- The maximum number of megabytes (mb) any given job is allowed. The corresponding environment variable isMAVIS_MEMORY_LIMIT
and the default value is16000
- min_anchor_exact
int
- Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum number of consecutive exact matches to anchor a read to initiate targeted realignment. The corresponding environment variable isMAVIS_MIN_ANCHOR_EXACT
and the default value is6
- min_anchor_fuzzy
int
- Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum length of a fuzzy match to anchor a read to initiate targeted realignment. The corresponding environment variable isMAVIS_MIN_ANCHOR_FUZZY
and the default value is10
- min_anchor_match
float_fraction
- Minimum percent match for a read to be kept as evidence. The corresponding environment variable isMAVIS_MIN_ANCHOR_MATCH
and the default value is0.9
- min_call_complexity
float_fraction
- The minimum complexity score for a call sequence. is an average for non-contig calls. filters low complexity contigs before alignment. see contig_complexity. The corresponding environment variable isMAVIS_MIN_CALL_COMPLEXITY
and the default value is0.1
- min_clusters_per_file
int
- The minimum number of breakpoint pairs to output to a file. The corresponding environment variable isMAVIS_MIN_CLUSTERS_PER_FILE
and the default value is50
- min_domain_mapping_match
float_fraction
- A number between 0 and 1 representing the minimum percent match a domain must map to the fusion transcript to be displayed. The corresponding environment variable isMAVIS_MIN_DOMAIN_MAPPING_MATCH
and the default value is0.9
- min_double_aligned_to_estimate_insertion_size
int
- The minimum number of reads which map soft-clipped to both breakpoints to assume the size of the untemplated sequence between the breakpoints is at most the read length - 2 * min_softclipping. The corresponding environment variable isMAVIS_MIN_DOUBLE_ALIGNED_TO_ESTIMATE_INSERTION_SIZE
and the default value is2
- min_flanking_pairs_resolution
int
- The minimum number of flanking reads required to call a breakpoint by flanking evidence. The corresponding environment variable isMAVIS_MIN_FLANKING_PAIRS_RESOLUTION
and the default value is10
- min_linking_split_reads
int
- The minimum number of split reads which aligned to both breakpoints. The corresponding environment variable isMAVIS_MIN_LINKING_SPLIT_READS
and the default value is2
- min_mapping_quality
int
- The minimum mapping quality of reads to be used as evidence. The corresponding environment variable isMAVIS_MIN_MAPPING_QUALITY
and the default value is5
- min_non_target_aligned_split_reads
int
- The minimum number of split reads aligned to a breakpoint by the input bam and no forced by local alignment to the target region to call a breakpoint by split read evidence. The corresponding environment variable isMAVIS_MIN_NON_TARGET_ALIGNED_SPLIT_READS
and the default value is1
- min_orf_size
int
- The minimum length (in base pairs) to retain a putative open reading frame (orf). The corresponding environment variable isMAVIS_MIN_ORF_SIZE
and the default value is300
- min_sample_size_to_apply_percentage
int
- Minimum number of aligned bases to compute a match percent. if there are less than this number of aligned bases (match or mismatch) the percent comparator is not used. The corresponding environment variable isMAVIS_MIN_SAMPLE_SIZE_TO_APPLY_PERCENTAGE
and the default value is10
- min_softclipping
int
- Minimum number of soft-clipped bases required for a read to be used as soft-clipped evidence. The corresponding environment variable isMAVIS_MIN_SOFTCLIPPING
and the default value is6
- min_spanning_reads_resolution
int
- Minimum number of spanning reads required to call an event by spanning evidence. The corresponding environment variable isMAVIS_MIN_SPANNING_READS_RESOLUTION
and the default value is5
- min_splits_reads_resolution
int
- Minimum number of split reads required to call a breakpoint by split reads. The corresponding environment variable isMAVIS_MIN_SPLITS_READS_RESOLUTION
and the default value is3
- novel_exon_color
str
- Novel exon fill color. The corresponding environment variable isMAVIS_NOVEL_EXON_COLOR
and the default value is'#5D3F6A'
- outer_window_min_event_size
int
- The minimum size of an event in order for flanking read evidence to be collected. The corresponding environment variable isMAVIS_OUTER_WINDOW_MIN_EVENT_SIZE
and the default value is125
- queue
str
- The queue jobs are to be submitted to. The corresponding environment variable isMAVIS_QUEUE
and the default value is''
- reference_genome
filepath
- Path to the human reference genome fasta file. The corresponding environment variable isMAVIS_REFERENCE_GENOME
and the default value is[]
- remote_head_ssh
str
- Ssh target for remote scheduler commands. The corresponding environment variable isMAVIS_REMOTE_HEAD_SSH
and the default value is''
- scaffold_color
str
- The color used for the gene/transcripts scaffolds. The corresponding environment variable isMAVIS_SCAFFOLD_COLOR
and the default value is'#000000'
- scheduler
SCHEDULER
- The scheduler being used. The corresponding environment variable isMAVIS_SCHEDULER
and the default value is'SLURM'
. Accepted values include:'SGE'
,'SLURM'
,'TORQUE'
,'LOCAL'
- spanning_call_distance
int
- The maximum distance allowed between breakpoint pairs (called by spanning reads) in order for them to pair. The corresponding environment variable isMAVIS_SPANNING_CALL_DISTANCE
and the default value is20
- splice_color
str
- Splicing lines color. The corresponding environment variable isMAVIS_SPLICE_COLOR
and the default value is'#000000'
- split_call_distance
int
- The maximum distance allowed between breakpoint pairs (called by split reads) in order for them to pair. The corresponding environment variable isMAVIS_SPLIT_CALL_DISTANCE
and the default value is20
- stdev_count_abnormal
float
- The number of standard deviations away from the normal considered expected and therefore not qualifying as flanking reads. The corresponding environment variable isMAVIS_STDEV_COUNT_ABNORMAL
and the default value is3.0
- strand_determining_read
int
- 1 or 2. the read in the pair which determines if (assuming a stranded protocol) the first or second read in the pair matches the strand sequenced. The corresponding environment variable isMAVIS_STRAND_DETERMINING_READ
and the default value is2
- template_metadata
filepath
- File containing the cytoband template information. used for illustrations only. The corresponding environment variable isMAVIS_TEMPLATE_METADATA
and the default value is[]
- time_limit
int
- The time in seconds any given jobs is allowed. The corresponding environment variable isMAVIS_TIME_LIMIT
and the default value is57600
- trans_fetch_reads_limit
int
- Related to fetch_reads_limit. overrides fetch_reads_limit for transcriptome libraries when set. if this has a value of none then fetch_reads_limit will be used for transcriptome libraries instead. The corresponding environment variable isMAVIS_TRANS_FETCH_READS_LIMIT
and the default value is12000
- trans_min_mapping_quality
int
- Related to min_mapping_quality. overrides the min_mapping_quality if the library is a transcriptome and this is set to any number not none. if this value is none, min_mapping_quality is used for transcriptomes aswell as genomes. The corresponding environment variable isMAVIS_TRANS_MIN_MAPPING_QUALITY
and the default value is0
- trans_validation_memory
int
- Default memory limit (mb) for the validation stage (for transcriptomes). The corresponding environment variable isMAVIS_TRANS_VALIDATION_MEMORY
and the default value is18000
- uninformative_filter
bool
- Flag that determines if breakpoint pairs which are not within max_proximity to any annotations are filtered out prior to clustering. The corresponding environment variable isMAVIS_UNINFORMATIVE_FILTER
and the default value isFalse
- validation_memory
int
- Default memory limit (mb) for the validation stage. The corresponding environment variable isMAVIS_VALIDATION_MEMORY
and the default value is16000
- width
int
- The drawing width in pixels. The corresponding environment variable isMAVIS_WIDTH
and the default value is1000
- write_evidence_files
bool
- Write the intermediate bam and bed files containing the raw evidence collected and contigs aligned. not required for subsequent steps but can be useful in debugging and deep investigation of events. The corresponding environment variable isMAVIS_WRITE_EVIDENCE_FILES
and the default value isTrue
Column Names¶
List of column names and their definitions. The types indicated here are the expected types in a row for a given column name.
- annotation_figure
FILEPATH
- File path to the svg drawing representing the annotation- annotation_figure_legend
JSON
- JSON data for the figure legend- annotation_id
- Identifier for the annotation step
- break1_chromosome
str
- The name of the chromosome on which breakpoint 1 is situated- break1_ewindow
int-int
- Window where evidence was gathered for the first breakpoint- break1_ewindow_count
int
- Number of reads processed/looked-at in the first evidence window- break1_ewindow_practical_coverage
float
- break2_ewindow_practical_coverage, break1_ewindow_count / len(break1_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff- break1_homologous_seq
str
- Sequence in common at the first breakpoint and other side of the second breakpoint- break1_orientation
ORIENT
- The side of the breakpoint wrt the positive/forward strand that is retained.- break1_position_end
int
- End integer inclusive 1-based of the range representing breakpoint 1- break1_position_start
int
- Start integer inclusive 1-based of the range representing breakpoint 1- break1_seq
str
- The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand- break1_split_reads
int
- Number of split reads that call the exact breakpoint given- break1_split_reads_forced
int
- Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment- break1_strand
STRAND
- The strand wrt to the reference positive/forward strand at this breakpoint.- break2_chromosome
- The name of the chromosome on which breakpoint 2 is situated
- break2_ewindow
int-int
- Window where evidence was gathered for the second breakpoint- break2_ewindow_count
int
- Number of reads processed/looked-at in the second evidence window- break2_ewindow_practical_coverage
float
- break2_ewindow_practical_coverage, break2_ewindow_count / len(break2_ewindow). Not the actual coverage as bins are sampled within and there is a read limit cutoff- break2_homologous_seq
str
- Sequence in common at the second breakpoint and other side of the first breakpoint- break2_orientation
ORIENT
- The side of the breakpoint wrt the positive/forward strand that is retained.- break2_position_end
int
- End integer inclusive 1-based of the range representing breakpoint 2- break2_position_start
int
- Start integer inclusive 1-based of the range representing breakpoint 2- break2_seq
str
- The sequence up to and including the breakpoint. Always given wrt to the positive/forward strand- break2_split_reads
int
- Number of split reads that call the exact breakpoint given- break2_split_reads_forced
int
- Number of split reads which were aligned to the opposite breakpoint window using a targeted alignment- break2_strand
STRAND
- The strand wrt to the reference positive/forward strand at this breakpoint.- call_method
CALL_METHOD
- The method used to call the breakpoints- call_sequence_complexity
float
- The minimum amount any two bases account for of the proportion of call sequence. An average for non-contig calls- cdna_synon
- semi-colon delimited list of transcript ids which have an identical cdna sequence to the cdna sequence of the current fusion product
- cluster_id
- Identifier for the merging/clustering step
- cluster_size
int
- The number of breakpoint pair calls that were grouped in creating the cluster- contig_alignment_cigar
- The cigar string(s) representing the contig alignment. Semi-colon delimited
- contig_alignment_query_name
- The query name for the contig alignment. Should match the ‘read’ name(s) in the .contigs.bam output file
- contig_alignment_reference_start
- The reference start(s) <chr>:<position> of the contig alignment. Semi-colon delimited
- contig_alignment_score
float
- A rank based on the alignment tool blat etc. of the alignment being used. An average if split alignments were used. Lower numbers indicate a better alignment. If it was the best alignment possible then this would be zero.- contig_build_score
int
- Score representing the edge weights of all edges used in building the sequence- contig_remap_coverage
float
- Fraction of the contig sequence which is covered by the remapped reads- contig_remap_score
float
- Score representing the number of sequences from the set of sequences given to the assembly algorithm that were aligned to the resulting contig with an acceptable scoring based on user-set thresholds. For any sequence its contribution to the score is divided by the number of mappings to give less weight to multimaps- contig_remapped_read_names
- read query names for the reads that were remapped. A -1 or -2 has been appended to the end of the name to indicate if this is the first or second read in the pair
- contig_remapped_reads
int
- the number of reads from the input bam that map to the assembled contig- contig_seq
str
- Sequence of the current contig wrt to the positive forward strand if not strand specific- contig_strand_specific
bool
- A flag to indicate if it was possible to resolve the strand for this contig- contigs_aligned
int
- Number of contigs that were able to align- contigs_assembled
int
- Number of contigs that were built from split read sequences- event_type
SVTYPE
- The classification of the event- flanking_median_fragment_size
int
- The median fragment size of the flanking reads being used as evidence- flanking_pairs
int
- Number of read-pairs where one read aligns to the first breakpoint window and the second read aligns to the other. The count here is based on the number of unique query names- flanking_pairs_compatible
int
- Number of flanking pairs of a compatible orientation type. This applies to insertions and duplications. Flanking pairs supporting an insertion will be compatible to a duplication and flanking pairs supporting a duplication will be compatible to an insertion (possibly indicating an internal translocation)- flanking_stdev_fragment_size
float
- The standard deviation in fragment size of the flanking reads being used as evidence- fusion_cdna_coding_end
- Position wrt the 5’ end of the fusion transcript where coding ends last base of the stop codon
- fusion_cdna_coding_end
int
- Position wrt the 5’ end of the fusion transcript where coding ends last base of the stop codon- fusion_cdna_coding_start
int
- Position wrt the 5’ end of the fusion transcript where coding begins first base of the Met amino acid.- fusion_mapped_domains
JSON
- List of domains in JSON format where each domain start and end positions are given wrt to the fusion transcript and the mapping quality is the number of matching amino acid positions over the total number of amino acids. The sequence is the amino acid sequence of the domain on the reference/original transcript- fusion_protein_hgvs
str
- Describes the fusion protein in HGVS notation. Will be None if the change is not an indel or is synonymous- fusion_sequence_fasta_file
FILEPATH
- Path to the corresponding fasta output file- fusion_sequence_fasta_id
- The sequence identifier for the cdna sequence output fasta file
- fusion_splicing_pattern
SPLICE_TYPE
- Type of splicing pattern used to create the fusion cDNA.- gene1
- Gene for the current annotation at the first breakpoint
- gene1_aliases
- Other gene names associated with the current annotation at the first breakpoint
- gene1_direction
PRIME
- The direction/prime of the gene- gene2
- Gene for the current annotation at the second breakpoint
- gene2_aliases
- Other gene names associated with the current annotation at the second breakpoint
- gene2_direction
PRIME
- The direction/prime of the gene. Has the following possible values- gene_product_type
GENE_PRODUCT_TYPE
- Describes if the putative fusion product will be sense or anti-sense- genes_encompassed
- Applies to intrachromosomal events only. List of genes which overlap any region that occurs between both breakpoints. For example in a deletion event these would be deleted genes.
- genes_overlapping_break1
- list of genes which overlap the first breakpoint
- genes_overlapping_break2
- list of genes which overlap the second breakpoint
- genes_proximal_to_break1
- list of genes near the breakpoint and the distance away from the breakpoint
- genes_proximal_to_break2
- list of genes near the breakpoint and the distance away from the breakpoint
- inferred_pairing
- A semi colon delimited of event identifiers i.e. <annotation_id>_<splicing pattern>_<cds start>_<cds end> which were paired to the current event based on predicted products
- library
- Identifier for the library/source
- linking_split_reads
int
- Number of split reads that align to both breakpoints- net_size
int-int
- The net size of an event. For translocations and inversion this will always be 0. For indels it will be negative for deletions and positive for insertions. It is a range to accommodate non-specific events.- opposing_strands
bool
- Specifies if breakpoints are on opposite strands wrt to the reference. Expects a boolean- pairing
- A semi colon delimited of event identifiers i.e. <annotation_id>_<splicing pattern>_<cds start>_<cds end> which were paired to the current event based on breakpoint positions
- product_id
- Unique identifier of the final fusion including splicing and ORF decision from the annotation step
- protein_synon
- semi-colon delimited list of transcript ids which produce a translation with an identical amino-acid sequence to the current fusion product
- protocol
PROTOCOL
- Specifies the type of library- raw_break1_split_reads
int
- Number of split reads before calling the breakpoint- raw_break2_split_reads
int
- Number of split reads before calling the breakpoint- raw_flanking_pairs
int
- Number of flanking reads before calling the breakpoint. The count here is based on the number of unique query names- raw_spanning_reads
int
- Number of spanning reads collected during evidence collection before calling the breakpoint- spanning_read_names
- read query names of the spanning reads which support the current event
- spanning_reads
int
- the number of spanning reads which support the event- stranded
bool
- Specifies if the sequencing protocol was strand specific or not. Expects a boolean- supplementary_call
bool
- Flag to indicate if the current event was a supplementary call, meaning a call that was found as a result of validating another event.- tools
- The tools that called the event originally from the cluster step. Should be a semi-colon delimited list of <tool name>_<tool version>
- tracking_id
- column used to store input identifiers from the original SV calls. Used to track calls from the input files to the final outputs.
- transcript1
- Transcript for the current annotation at the first breakpoint
- transcript2
- Transcript for the current annotation at the second breakpoint
- untemplated_seq
str
- The untemplated/novel sequence between the breakpoints- validation_id
- Identifier for the validation step