Configurable Settings
Defining Samples/Libraries
The libraries
property of the mavis config is required to run the snakemake
workflow. This is the section that defines what inputs to use, and what types of
samples are available.
{
"libraries": {
"<LIBRARY_NAME>": { } // mapping of library name to library settings
}
}
The library specific settings are listed below
assign
type: List[str]
List of input files or conversion aliases that should be processed for this library
schema definition:
{
"items": {
"type": "string"
},
"minItems": 1,
"type": "array"
}
bam_file
type: str
Path to the bam file containing the sequencing reads for this library
disease_status
type: str
schema definition:
{
"enum": [
"diseased",
"normal"
],
"type": "string"
}
median_fragment_size
type: int
The median fragment size in the paired-end read library. This will be computed from the bam during initialization of the config if not given
protocol
type: str
schema definition:
{
"enum": [
"genome",
"transcriptome"
],
"type": "string"
}
read_length
type: int
The read length in the paired-end read library. This will be computed from the bam during initialization of the config if not given
stdev_fragment_size
type: int
The standard deviation of fragment size in the paired-end read library. This will be computed from the bam during initialization of the config if not given
strand_determining_read
type: int
default: 2
1 or 2. the read in the pair which determines if (assuming a stranded protocol) the first or second read in the pair matches the strand sequenced
strand_specific
type: bool
default: False
total_batches
type: int
The number of jobs to slit a library into for cluster/validate/annotate. This will be set during initialization of the config if not given
schema definition:
{
"min": 1,
"type": "integer"
}
Defining Conversions
If the input to MAVIS is raw tool output and has not been pre-converted to the standard tab delimited format expected by MAVIS then you will need to add a section to the config to tell mavis how to perform the required conversions
{
"convert": {
"<ALIAS>": { } // mapping of alias to conversion settings
}
}
The conversion specific settings are listed below
assume_no_untemplated
type: bool
default: False
Assume the lack of untemplated information means that there IS not untemplated sequence expected at the breakpoints
file_type
type: str
The tool the file is input from. This is the loader method that will be used, if not given the loader method will default to match the tool name that is given. This value should be 'mavis' for standard mavis-style tab files, or 'vcf' for the general vcf loader
schema definition:
{
"enum": [
"breakdancer",
"breakseq",
"chimerascan",
"cnvnator",
"cutesv",
"defuse",
"delly",
"manta",
"mavis",
"pindel",
"sniffles",
"starfusion",
"straglr",
"strelka",
"transabyss",
"vcf"
],
"type": "string"
}
inputs
type: List[str]
List of input files
schema definition:
{
"items": {
"type": "string"
},
"minItems": 1,
"type": "array"
}
strand_specific
type: bool
default: False
tool_name
type: str
Name of the tool to be used in MAVIS output files (in the tools column). If not given the file_type will be used as the tool name instead
schema definition:
{
"pattern": "^[^;]+$",
"type": "string"
}
General Settings
annotate.annotation_filters
type: List[str]
default: ['choose_more_annotated', 'choose_transcripts_by_priority']
A comma separated list of filters to apply to putative annotations
schema definition:
{
"items": {
"enum": [
"choose_more_annotated",
"choose_transcripts_by_priority"
],
"type": "string"
},
"type": "array"
}
annotate.draw_fusions_only
type: bool
default: True
Flag to indicate if events which do not produce a fusion transcript should produce illustrations
annotate.draw_non_synonymous_cdna_only
type: bool
default: True
Flag to indicate if events which are synonymous at the cdna level should produce illustrations
annotate.max_orf_cap
type: int
default: 3
The maximum number of orfs to return (best putative orfs will be retained)
annotate.min_domain_mapping_match
type: float
default: 0.9
A number between 0 and 1 representing the minimum percent match a domain must map to the fusion transcript to be displayed
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
annotate.min_orf_size
type: int
default: 300
The minimum length (in base pairs) to retain a putative open reading frame (orf)
bam_stats.distribution_fraction
type: float
default: 0.97
the proportion of the distribution to use in computing stdev
schema definition:
{
"maximum": 1,
"minimum": 0.01,
"type": "number"
}
bam_stats.sample_bin_size
type: int
default: 1000
how large to make the sample bin (in bp)
bam_stats.sample_cap
type: int
default: 1000
maximum number of reads to collect for any given sample region
bam_stats.sample_size
type: int
default: 500
the number of genes/bins to compute stats over
cluster.cluster_initial_size_limit
type: int
default: 25
The maximum cumulative size of both breakpoints for breakpoint pairs to be used in the initial clustering phase (combining based on overlap)
cluster.cluster_radius
type: int
default: 100
Maximum distance allowed between paired breakpoint pairs
cluster.limit_to_chr
type: Union[List, null]
default: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']
A list of chromosome names to use. breakpointpairs on other chromosomes will be filteredout. for example '1 2 3 4' would filter out events/breakpoint pairs on any chromosomes but 1, 2, 3, and 4
schema definition:
{
"items": {
"type": "string"
},
"type": [
"array",
"null"
]
}
cluster.max_files
type: int
default: 200
The maximum number of files to output from clustering/splitting
schema definition:
{
"minimum": 1,
"type": "integer"
}
cluster.max_proximity
type: int
default: 5000
The maximum distance away from an annotation before the region in considered to be uninformative
cluster.min_clusters_per_file
type: int
default: 50
The minimum number of breakpoint pairs to output to a file
schema definition:
{
"minimum": 1,
"type": "integer"
}
cluster.split_only
type: bool
default: False
just split the input files, do not merge input breakpoints into clusters
cluster.uninformative_filter
type: bool
default: False
Flag that determines if breakpoint pairs which are not within max_proximity to any annotations are filtered out prior to clustering
illustrate.breakpoint_color
type: str
default: '#000000'
Breakpoint outline color
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.domain_color
type: str
default: '#ccccb3'
Domain fill color
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.domain_mismatch_color
type: str
default: '#b2182b'
Domain fill color on 0%% match
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.domain_name_regex_filter
type: str
default: '^PF\\d+$'
The regular expression used to select domains to be displayed (filtered by name)
illustrate.domain_scaffold_color
type: str
default: '#000000'
The color of the domain scaffold
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.drawing_width_iter_increase
type: int
default: 500
The amount (in pixels) by which to increase the drawing width upon failure to fit
illustrate.exon_min_focus_size
type: int
default: 10
Minimum size of an exon for it to be granted a label or min exon width
illustrate.gene1_color
type: str
default: '#657e91'
The color of genes near the first gene
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.gene1_color_selected
type: str
default: '#518dc5'
The color of the first gene
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.gene2_color
type: str
default: '#325556'
The color of genes near the second gene
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.gene2_color_selected
type: str
default: '#4c9677'
The color of the second gene
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.label_color
type: str
default: '#000000'
The label color
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.mask_fill
type: str
default: '#ffffff'
Color of mask (for deleted region etc.)
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.mask_opacity
type: float
default: 0.7
Opacity of the mask layer
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
illustrate.max_drawing_retries
type: int
default: 5
The maximum number of retries for attempting a drawing. each iteration the width is extended. if it is still insufficient after this number a gene-level only drawing will be output
illustrate.novel_exon_color
type: str
default: '#5D3F6A'
Novel exon fill color
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.scaffold_color
type: str
default: '#000000'
The color used for the gene/transcripts scaffolds
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.splice_color
type: str
default: '#000000'
Splicing lines color
schema definition:
{
"pattern": "^#[a-zA-Z0-9]{6}",
"type": "string"
}
illustrate.width
type: int
default: 1000
The drawing width in pixels
log
type: str
log_level
type: str
default: 'INFO'
schema definition:
{
"enum": [
"INFO",
"DEBUG"
],
"type": "string"
}
output_dir
type: str
path to the directory to output the MAVIS files to
pairing.contig_call_distance
type: int
default: 10
The maximum distance allowed between breakpoint pairs (called by contig) in order for them to pair
pairing.flanking_call_distance
type: int
default: 50
The maximum distance allowed between breakpoint pairs (called by flanking pairs) in order for them to pair
pairing.input_call_distance
type: int
default: 20
The maximum distance allowed between breakpoint pairs (called by input tools, not validated) in order for them to pair
pairing.spanning_call_distance
type: int
default: 20
The maximum distance allowed between breakpoint pairs (called by spanning reads) in order for them to pair
pairing.split_call_distance
type: int
default: 20
The maximum distance allowed between breakpoint pairs (called by split reads) in order for them to pair
reference.aligner_reference
type: List[str]
The reference genome file used by the aligner
schema definition:
{
"examples": [
"tests/data/mock_reference_genome.2bit"
],
"items": {
"type": "string"
},
"maxItems": 1,
"minItems": 1,
"type": "array"
}
reference.annotations
type: List[str]
The reference file containing gene/transcript position information
schema definition:
{
"examples": [
"tests/data/mock_annotations.json"
],
"items": {
"type": "string"
},
"minItems": 1,
"type": "array"
}
reference.dgv_annotation
type: List[str]
schema definition:
{
"examples": [
[
"tests/data/mock_dgv_annotation.txt"
]
],
"items": {
"type": "string"
},
"minItems": 1,
"type": "array"
}
reference.masking
type: List[str]
A list of regions to ignore in validation. Generally these are centromeres and telomeres or known poor mapping areas
schema definition:
{
"examples": [
[
"tests/data/mock_masking.tab"
]
],
"items": {
"type": "string"
},
"minItems": 1,
"type": "array"
}
reference.reference_genome
type: List[str]
schema definition:
{
"examples": [
[
"tests/data/mock_reference_genome.fa"
]
],
"items": {
"type": "string"
},
"minItems": 1,
"type": "array"
}
reference.template_metadata
type: List[str]
schema definition:
{
"examples": [
[
"tests/data/cytoBand.txt"
]
],
"items": {
"type": "string"
},
"minItems": 1,
"type": "array"
}
skip_stage.validate
type: bool
default: False
skip the validation stage of the MAVIS pipeline
summary.cluster_radius
type: int
default: 10
Distance used in matching input SVs to reference SVs through clustering
schema definition:
{
"minimum": 0,
"type": "integer"
}
summary.filter_cdna_synon
type: bool
default: True
Filter all annotations synonymous at the cdna level
summary.filter_min_complexity
type: float
default: 0.2
Filter event calls based on call sequence complexity
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
summary.filter_min_flanking_reads
type: int
default: 10
Minimum number of flanking pairs for a call by flanking pairs
summary.filter_min_linking_split_reads
type: int
default: 1
Minimum number of linking split reads for a call by split reads
summary.filter_min_remapped_reads
type: int
default: 5
Minimum number of remapped reads for a call by contig
summary.filter_min_spanning_reads
type: int
default: 5
Minimum number of spanning reads for a call by spanning reads
summary.filter_min_split_reads
type: int
default: 5
Minimum number of split reads for a call by split reads
summary.filter_protein_synon
type: bool
default: False
Filter all annotations synonymous at the protein level
summary.filter_trans_homopolymers
type: bool
default: True
Filter all single bp ins/del/dup events that are in a homopolymer region of at least 3 bps and are not paired to a genomic event
validate.aligner
type: str
default: 'blat'
The aligner to use to map the contigs/reads back to the reference e.g blat or bwa
schema definition:
{
"enum": [
"bwa mem",
"blat"
],
"type": "string"
}
validate.assembly_kmer_size
type: float
default: 0.74
The percent of the read length to make kmers for assembly
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.assembly_max_paths
type: int
default: 8
The maximum number of paths to resolve. this is used to limit when there is a messy assembly graph to resolve. the assembly will pre-calculate the number of paths (or putative assemblies) and stop if it is greater than the given setting
validate.assembly_min_edge_trim_weight
type: int
default: 3
This is used to simplify the debruijn graph before path finding. edges with less than this frequency will be discarded if they are non-cutting, at a fork, or the end of a path
validate.assembly_min_exact_match_to_remap
type: int
default: 15
The minimum length of exact matches to initiate remapping a read to a contig
validate.assembly_min_remap_coverage
type: float
default: 0.9
Minimum fraction of the contig sequence which the remapped sequences must align over
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.assembly_min_remapped_seq
type: int
default: 3
The minimum input sequences that must remap for an assembled contig to be used
validate.assembly_min_uniq
type: float
default: 0.1
Minimum percent uniq required to keep separate assembled contigs. if contigs are more similar then the lower scoring, then shorter, contig is dropped
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.assembly_strand_concordance
type: float
default: 0.51
When the number of remapped reads from each strand are compared, the ratio must be above this number to decide on the strand
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.blat_limit_top_aln
type: int
default: 10
Number of results to return from blat (ranking based on score)
validate.blat_min_identity
type: float
default: 0.9
The minimum percent identity match required for blat results when aligning contigs
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.call_error
type: int
default: 10
Buffer zone for the evidence window
validate.clean_aligner_files
type: bool
default: False
Remove the aligner output files after the validation stage is complete. not required for subsequent steps but can be useful in debugging and deep investigation of events
validate.contig_aln_max_event_size
type: int
default: 50
Relates to determining breakpoints when pairing contig alignments. for any given read in a putative pair the soft clipping is extended to include any events of greater than this size. the softclipping is added to the side of the alignment as indicated by the breakpoint we are assigning pairs to
validate.contig_aln_merge_inner_anchor
type: int
default: 20
The minimum number of consecutive exact match base pairs to not merge events within a contig alignment
validate.contig_aln_merge_outer_anchor
type: int
default: 15
Minimum consecutively aligned exact matches to anchor an end for merging internal events
validate.contig_aln_min_anchor_size
type: int
default: 50
The minimum number of aligned bases for a contig (m or =) in order to simplify. do not have to be consecutive
validate.contig_aln_min_extend_overlap
type: int
default: 10
Minimum number of bases the query coverage interval must be extended by in order to pair alignments as a single split alignment
validate.contig_aln_min_query_consumption
type: float
default: 0.9
Minimum fraction of the original query sequence that must be used by the read(s) of the alignment
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.contig_aln_min_score
type: float
default: 0.9
Minimum score for a contig to be used as evidence in a call by contig
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.fetch_min_bin_size
type: int
default: 50
The minimum size of any bin for reading from a bam file. increasing this number will result in smaller bins being merged or less bins being created (depending on the fetch method)
validate.fetch_reads_bins
type: int
default: 5
Number of bins to split an evidence window into to ensure more even sampling of high coverage regions
validate.fetch_reads_limit
type: int
default: 3000
Maximum number of reads, cap, to loop over for any given evidence window
validate.filter_secondary_alignments
type: bool
default: True
Filter secondary alignments when gathering read evidence
validate.fuzzy_mismatch_number
type: int
default: 1
The number of events/mismatches allowed to be considered a fuzzy match
validate.max_sc_preceeding_anchor
type: int
default: 6
When remapping a softclipped read this determines the amount of softclipping allowed on the side opposite of where we expect it. for example for a softclipped read on a breakpoint with a left orientation this limits the amount of softclipping that is allowed on the right. if this is set to none then there is no limit on softclipping
validate.min_anchor_exact
type: int
default: 6
Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum number of consecutive exact matches to anchor a read to initiate targeted realignment
validate.min_anchor_fuzzy
type: int
default: 10
Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum length of a fuzzy match to anchor a read to initiate targeted realignment
validate.min_anchor_match
type: float
default: 0.9
Minimum percent match for a read to be kept as evidence
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.min_call_complexity
type: float
default: 0.1
The minimum complexity score for a call sequence. is an average for non-contig calls. filters low complexity contigs before alignment. see contig_complexity
schema definition:
{
"maximum": 1,
"minimum": 0,
"type": "number"
}
validate.min_double_aligned_to_estimate_insertion_size
type: int
default: 2
The minimum number of reads which map soft-clipped to both breakpoints to assume the size of the untemplated sequence between the breakpoints is at most the read length - 2 * min_softclipping
validate.min_flanking_pairs_resolution
type: int
default: 10
The minimum number of flanking reads required to call a breakpoint by flanking evidence
validate.min_linking_split_reads
type: int
default: 2
The minimum number of split reads which aligned to both breakpoints
validate.min_mapping_quality
type: int
default: 5
The minimum mapping quality of reads to be used as evidence
validate.min_non_target_aligned_split_reads
type: int
default: 1
The minimum number of split reads aligned to a breakpoint by the input bam and no forced by local alignment to the target region to call a breakpoint by split read evidence
validate.min_sample_size_to_apply_percentage
type: int
default: 10
Minimum number of aligned bases to compute a match percent. if there are less than this number of aligned bases (match or mismatch) the percent comparator is not used
validate.min_softclipping
type: int
default: 6
Minimum number of soft-clipped bases required for a read to be used as soft-clipped evidence
validate.min_spanning_reads_resolution
type: int
default: 5
Minimum number of spanning reads required to call an event by spanning evidence
validate.min_splits_reads_resolution
type: int
default: 3
Minimum number of split reads required to call a breakpoint by split reads
validate.outer_window_min_event_size
type: int
default: 125
The minimum size of an event in order for flanking read evidence to be collected
validate.stdev_count_abnormal
type: float
default: 3
The number of standard deviations away from the normal considered expected and therefore not qualifying as flanking reads
validate.trans_fetch_reads_limit
type: Union[int, null]
default: 12000
Related to fetch_reads_limit. overrides fetch_reads_limit for transcriptome libraries when set. if this has a value of none then fetch_reads_limit will be used for transcriptome libraries instead
validate.trans_min_mapping_quality
type: Union[int, null]
default: 0
Related to min_mapping_quality. overrides the min_mapping_quality if the library is a transcriptome and this is set to any number not none. if this value is none, min_mapping_quality is used for transcriptomes aswell as genomes
validate.write_evidence_files
type: bool
default: True
Write the intermediate bam and bed files containing the raw evidence collected and contigs aligned. not required for subsequent steps but can be useful in debugging and deep investigation of events