Configurable Settings

Defining Samples/Libraries

The libraries property of the mavis config is required to run the snakemake workflow. This is the section that defines what inputs to use, and what types of samples are available.

{
    "libraries": {
        "<LIBRARY_NAME>": { }  // mapping of library name to library settings
    }
}

The library specific settings are listed below

assign

type: List[str]

List of input files or conversion aliases that should be processed for this library

schema definition:

{
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "type": "array"
}

bam_file

type: str

Path to the bam file containing the sequencing reads for this library

disease_status

type: str

schema definition:

{
    "enum": [
        "diseased",
        "normal"
    ],
    "type": "string"
}

median_fragment_size

type: int

The median fragment size in the paired-end read library. This will be computed from the bam during initialization of the config if not given

protocol

type: str

schema definition:

{
    "enum": [
        "genome",
        "transcriptome"
    ],
    "type": "string"
}

read_length

type: int

The read length in the paired-end read library. This will be computed from the bam during initialization of the config if not given

stdev_fragment_size

type: int

The standard deviation of fragment size in the paired-end read library. This will be computed from the bam during initialization of the config if not given

strand_determining_read

type: int

default: 2

1 or 2. the read in the pair which determines if (assuming a stranded protocol) the first or second read in the pair matches the strand sequenced

strand_specific

type: bool

default: False

total_batches

type: int

The number of jobs to slit a library into for cluster/validate/annotate. This will be set during initialization of the config if not given

schema definition:

{
    "min": 1,
    "type": "integer"
}

Defining Conversions

If the input to MAVIS is raw tool output and has not been pre-converted to the standard tab delimited format expected by MAVIS then you will need to add a section to the config to tell mavis how to perform the required conversions

{
    "convert": {
        "<ALIAS>": { }  // mapping of alias to conversion settings
    }
}

The conversion specific settings are listed below

assume_no_untemplated

type: bool

default: False

Assume the lack of untemplated information means that there IS not untemplated sequence expected at the breakpoints

file_type

type: str

The tool the file is input from. This is the loader method that will be used, if not given the loader method will default to match the tool name that is given. This value should be 'mavis' for standard mavis-style tab files, or 'vcf' for the general vcf loader

schema definition:

{
    "enum": [
        "breakdancer",
        "breakseq",
        "chimerascan",
        "cnvnator",
        "cutesv",
        "defuse",
        "delly",
        "manta",
        "mavis",
        "pindel",
        "sniffles",
        "starfusion",
        "straglr",
        "strelka",
        "transabyss",
        "vcf"
    ],
    "type": "string"
}

inputs

type: List[str]

List of input files

schema definition:

{
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "type": "array"
}

strand_specific

type: bool

default: False

tool_name

type: str

Name of the tool to be used in MAVIS output files (in the tools column). If not given the file_type will be used as the tool name instead

schema definition:

{
    "pattern": "^[^;]+$",
    "type": "string"
}

General Settings

annotate.annotation_filters

type: List[str]

default: ['choose_more_annotated', 'choose_transcripts_by_priority']

A comma separated list of filters to apply to putative annotations

schema definition:

{
    "items": {
        "enum": [
            "choose_more_annotated",
            "choose_transcripts_by_priority"
        ],
        "type": "string"
    },
    "type": "array"
}

annotate.draw_fusions_only

type: bool

default: True

Flag to indicate if events which do not produce a fusion transcript should produce illustrations

annotate.draw_non_synonymous_cdna_only

type: bool

default: True

Flag to indicate if events which are synonymous at the cdna level should produce illustrations

annotate.max_orf_cap

type: int

default: 3

The maximum number of orfs to return (best putative orfs will be retained)

annotate.min_domain_mapping_match

type: float

default: 0.9

A number between 0 and 1 representing the minimum percent match a domain must map to the fusion transcript to be displayed

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

annotate.min_orf_size

type: int

default: 300

The minimum length (in base pairs) to retain a putative open reading frame (orf)

bam_stats.distribution_fraction

type: float

default: 0.97

the proportion of the distribution to use in computing stdev

schema definition:

{
    "maximum": 1,
    "minimum": 0.01,
    "type": "number"
}

bam_stats.sample_bin_size

type: int

default: 1000

how large to make the sample bin (in bp)

bam_stats.sample_cap

type: int

default: 1000

maximum number of reads to collect for any given sample region

bam_stats.sample_size

type: int

default: 500

the number of genes/bins to compute stats over

cluster.cluster_initial_size_limit

type: int

default: 25

The maximum cumulative size of both breakpoints for breakpoint pairs to be used in the initial clustering phase (combining based on overlap)

cluster.cluster_radius

type: int

default: 100

Maximum distance allowed between paired breakpoint pairs

cluster.limit_to_chr

type: Union[List, null]

default: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']

A list of chromosome names to use. breakpointpairs on other chromosomes will be filteredout. for example '1 2 3 4' would filter out events/breakpoint pairs on any chromosomes but 1, 2, 3, and 4

schema definition:

{
    "items": {
        "type": "string"
    },
    "type": [
        "array",
        "null"
    ]
}

cluster.max_files

type: int

default: 200

The maximum number of files to output from clustering/splitting

schema definition:

{
    "minimum": 1,
    "type": "integer"
}

cluster.max_proximity

type: int

default: 5000

The maximum distance away from an annotation before the region in considered to be uninformative

cluster.min_clusters_per_file

type: int

default: 50

The minimum number of breakpoint pairs to output to a file

schema definition:

{
    "minimum": 1,
    "type": "integer"
}

cluster.split_only

type: bool

default: False

just split the input files, do not merge input breakpoints into clusters

cluster.uninformative_filter

type: bool

default: False

Flag that determines if breakpoint pairs which are not within max_proximity to any annotations are filtered out prior to clustering

illustrate.breakpoint_color

type: str

default: '#000000'

Breakpoint outline color

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.domain_color

type: str

default: '#ccccb3'

Domain fill color

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.domain_mismatch_color

type: str

default: '#b2182b'

Domain fill color on 0%% match

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.domain_name_regex_filter

type: str

default: '^PF\\d+$'

The regular expression used to select domains to be displayed (filtered by name)

illustrate.domain_scaffold_color

type: str

default: '#000000'

The color of the domain scaffold

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.drawing_width_iter_increase

type: int

default: 500

The amount (in pixels) by which to increase the drawing width upon failure to fit

illustrate.exon_min_focus_size

type: int

default: 10

Minimum size of an exon for it to be granted a label or min exon width

illustrate.gene1_color

type: str

default: '#657e91'

The color of genes near the first gene

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.gene1_color_selected

type: str

default: '#518dc5'

The color of the first gene

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.gene2_color

type: str

default: '#325556'

The color of genes near the second gene

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.gene2_color_selected

type: str

default: '#4c9677'

The color of the second gene

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.label_color

type: str

default: '#000000'

The label color

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.mask_fill

type: str

default: '#ffffff'

Color of mask (for deleted region etc.)

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.mask_opacity

type: float

default: 0.7

Opacity of the mask layer

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

illustrate.max_drawing_retries

type: int

default: 5

The maximum number of retries for attempting a drawing. each iteration the width is extended. if it is still insufficient after this number a gene-level only drawing will be output

illustrate.novel_exon_color

type: str

default: '#5D3F6A'

Novel exon fill color

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.scaffold_color

type: str

default: '#000000'

The color used for the gene/transcripts scaffolds

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.splice_color

type: str

default: '#000000'

Splicing lines color

schema definition:

{
    "pattern": "^#[a-zA-Z0-9]{6}",
    "type": "string"
}

illustrate.width

type: int

default: 1000

The drawing width in pixels

log

type: str

log_level

type: str

default: 'INFO'

schema definition:

{
    "enum": [
        "INFO",
        "DEBUG"
    ],
    "type": "string"
}

output_dir

type: str

path to the directory to output the MAVIS files to

pairing.contig_call_distance

type: int

default: 10

The maximum distance allowed between breakpoint pairs (called by contig) in order for them to pair

pairing.flanking_call_distance

type: int

default: 50

The maximum distance allowed between breakpoint pairs (called by flanking pairs) in order for them to pair

pairing.input_call_distance

type: int

default: 20

The maximum distance allowed between breakpoint pairs (called by input tools, not validated) in order for them to pair

pairing.spanning_call_distance

type: int

default: 20

The maximum distance allowed between breakpoint pairs (called by spanning reads) in order for them to pair

pairing.split_call_distance

type: int

default: 20

The maximum distance allowed between breakpoint pairs (called by split reads) in order for them to pair

reference.aligner_reference

type: List[str]

The reference genome file used by the aligner

schema definition:

{
    "examples": [
        "tests/data/mock_reference_genome.2bit"
    ],
    "items": {
        "type": "string"
    },
    "maxItems": 1,
    "minItems": 1,
    "type": "array"
}

reference.annotations

type: List[str]

The reference file containing gene/transcript position information

schema definition:

{
    "examples": [
        "tests/data/mock_annotations.json"
    ],
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "type": "array"
}

reference.dgv_annotation

type: List[str]

schema definition:

{
    "examples": [
        [
            "tests/data/mock_dgv_annotation.txt"
        ]
    ],
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "type": "array"
}

reference.masking

type: List[str]

A list of regions to ignore in validation. Generally these are centromeres and telomeres or known poor mapping areas

schema definition:

{
    "examples": [
        [
            "tests/data/mock_masking.tab"
        ]
    ],
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "type": "array"
}

reference.reference_genome

type: List[str]

schema definition:

{
    "examples": [
        [
            "tests/data/mock_reference_genome.fa"
        ]
    ],
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "type": "array"
}

reference.template_metadata

type: List[str]

schema definition:

{
    "examples": [
        [
            "tests/data/cytoBand.txt"
        ]
    ],
    "items": {
        "type": "string"
    },
    "minItems": 1,
    "type": "array"
}

skip_stage.validate

type: bool

default: False

skip the validation stage of the MAVIS pipeline

summary.cluster_radius

type: int

default: 10

Distance used in matching input SVs to reference SVs through clustering

schema definition:

{
    "minimum": 0,
    "type": "integer"
}

summary.filter_cdna_synon

type: bool

default: True

Filter all annotations synonymous at the cdna level

summary.filter_min_complexity

type: float

default: 0.2

Filter event calls based on call sequence complexity

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

summary.filter_min_flanking_reads

type: int

default: 10

Minimum number of flanking pairs for a call by flanking pairs

summary.filter_min_linking_split_reads

type: int

default: 1

Minimum number of linking split reads for a call by split reads

summary.filter_min_remapped_reads

type: int

default: 5

Minimum number of remapped reads for a call by contig

summary.filter_min_spanning_reads

type: int

default: 5

Minimum number of spanning reads for a call by spanning reads

summary.filter_min_split_reads

type: int

default: 5

Minimum number of split reads for a call by split reads

summary.filter_protein_synon

type: bool

default: False

Filter all annotations synonymous at the protein level

summary.filter_trans_homopolymers

type: bool

default: True

Filter all single bp ins/del/dup events that are in a homopolymer region of at least 3 bps and are not paired to a genomic event

validate.aligner

type: str

default: 'blat'

The aligner to use to map the contigs/reads back to the reference e.g blat or bwa

schema definition:

{
    "enum": [
        "bwa mem",
        "blat"
    ],
    "type": "string"
}

validate.assembly_kmer_size

type: float

default: 0.74

The percent of the read length to make kmers for assembly

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.assembly_max_paths

type: int

default: 8

The maximum number of paths to resolve. this is used to limit when there is a messy assembly graph to resolve. the assembly will pre-calculate the number of paths (or putative assemblies) and stop if it is greater than the given setting

validate.assembly_min_edge_trim_weight

type: int

default: 3

This is used to simplify the debruijn graph before path finding. edges with less than this frequency will be discarded if they are non-cutting, at a fork, or the end of a path

validate.assembly_min_exact_match_to_remap

type: int

default: 15

The minimum length of exact matches to initiate remapping a read to a contig

validate.assembly_min_remap_coverage

type: float

default: 0.9

Minimum fraction of the contig sequence which the remapped sequences must align over

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.assembly_min_remapped_seq

type: int

default: 3

The minimum input sequences that must remap for an assembled contig to be used

validate.assembly_min_uniq

type: float

default: 0.1

Minimum percent uniq required to keep separate assembled contigs. if contigs are more similar then the lower scoring, then shorter, contig is dropped

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.assembly_strand_concordance

type: float

default: 0.51

When the number of remapped reads from each strand are compared, the ratio must be above this number to decide on the strand

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.blat_limit_top_aln

type: int

default: 10

Number of results to return from blat (ranking based on score)

validate.blat_min_identity

type: float

default: 0.9

The minimum percent identity match required for blat results when aligning contigs

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.call_error

type: int

default: 10

Buffer zone for the evidence window

validate.clean_aligner_files

type: bool

default: False

Remove the aligner output files after the validation stage is complete. not required for subsequent steps but can be useful in debugging and deep investigation of events

validate.contig_aln_max_event_size

type: int

default: 50

Relates to determining breakpoints when pairing contig alignments. for any given read in a putative pair the soft clipping is extended to include any events of greater than this size. the softclipping is added to the side of the alignment as indicated by the breakpoint we are assigning pairs to

validate.contig_aln_merge_inner_anchor

type: int

default: 20

The minimum number of consecutive exact match base pairs to not merge events within a contig alignment

validate.contig_aln_merge_outer_anchor

type: int

default: 15

Minimum consecutively aligned exact matches to anchor an end for merging internal events

validate.contig_aln_min_anchor_size

type: int

default: 50

The minimum number of aligned bases for a contig (m or =) in order to simplify. do not have to be consecutive

validate.contig_aln_min_extend_overlap

type: int

default: 10

Minimum number of bases the query coverage interval must be extended by in order to pair alignments as a single split alignment

validate.contig_aln_min_query_consumption

type: float

default: 0.9

Minimum fraction of the original query sequence that must be used by the read(s) of the alignment

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.contig_aln_min_score

type: float

default: 0.9

Minimum score for a contig to be used as evidence in a call by contig

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.fetch_min_bin_size

type: int

default: 50

The minimum size of any bin for reading from a bam file. increasing this number will result in smaller bins being merged or less bins being created (depending on the fetch method)

validate.fetch_reads_bins

type: int

default: 5

Number of bins to split an evidence window into to ensure more even sampling of high coverage regions

validate.fetch_reads_limit

type: int

default: 3000

Maximum number of reads, cap, to loop over for any given evidence window

validate.filter_secondary_alignments

type: bool

default: True

Filter secondary alignments when gathering read evidence

validate.fuzzy_mismatch_number

type: int

default: 1

The number of events/mismatches allowed to be considered a fuzzy match

validate.max_sc_preceeding_anchor

type: int

default: 6

When remapping a softclipped read this determines the amount of softclipping allowed on the side opposite of where we expect it. for example for a softclipped read on a breakpoint with a left orientation this limits the amount of softclipping that is allowed on the right. if this is set to none then there is no limit on softclipping

validate.min_anchor_exact

type: int

default: 6

Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum number of consecutive exact matches to anchor a read to initiate targeted realignment

validate.min_anchor_fuzzy

type: int

default: 10

Applies to re-aligning softclipped reads to the opposing breakpoint. the minimum length of a fuzzy match to anchor a read to initiate targeted realignment

validate.min_anchor_match

type: float

default: 0.9

Minimum percent match for a read to be kept as evidence

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.min_call_complexity

type: float

default: 0.1

The minimum complexity score for a call sequence. is an average for non-contig calls. filters low complexity contigs before alignment. see contig_complexity

schema definition:

{
    "maximum": 1,
    "minimum": 0,
    "type": "number"
}

validate.min_double_aligned_to_estimate_insertion_size

type: int

default: 2

The minimum number of reads which map soft-clipped to both breakpoints to assume the size of the untemplated sequence between the breakpoints is at most the read length - 2 * min_softclipping

validate.min_flanking_pairs_resolution

type: int

default: 10

The minimum number of flanking reads required to call a breakpoint by flanking evidence

validate.min_linking_split_reads

type: int

default: 2

The minimum number of split reads which aligned to both breakpoints

validate.min_mapping_quality

type: int

default: 5

The minimum mapping quality of reads to be used as evidence

validate.min_non_target_aligned_split_reads

type: int

default: 1

The minimum number of split reads aligned to a breakpoint by the input bam and no forced by local alignment to the target region to call a breakpoint by split read evidence

validate.min_sample_size_to_apply_percentage

type: int

default: 10

Minimum number of aligned bases to compute a match percent. if there are less than this number of aligned bases (match or mismatch) the percent comparator is not used

validate.min_softclipping

type: int

default: 6

Minimum number of soft-clipped bases required for a read to be used as soft-clipped evidence

validate.min_spanning_reads_resolution

type: int

default: 5

Minimum number of spanning reads required to call an event by spanning evidence

validate.min_splits_reads_resolution

type: int

default: 3

Minimum number of split reads required to call a breakpoint by split reads

validate.outer_window_min_event_size

type: int

default: 125

The minimum size of an event in order for flanking read evidence to be collected

validate.stdev_count_abnormal

type: float

default: 3

The number of standard deviations away from the normal considered expected and therefore not qualifying as flanking reads

validate.trans_fetch_reads_limit

type: Union[int, null]

default: 12000

Related to fetch_reads_limit. overrides fetch_reads_limit for transcriptome libraries when set. if this has a value of none then fetch_reads_limit will be used for transcriptome libraries instead

validate.trans_min_mapping_quality

type: Union[int, null]

default: 0

Related to min_mapping_quality. overrides the min_mapping_quality if the library is a transcriptome and this is set to any number not none. if this value is none, min_mapping_quality is used for transcriptomes aswell as genomes

validate.write_evidence_files

type: bool

default: True

Write the intermediate bam and bed files containing the raw evidence collected and contigs aligned. not required for subsequent steps but can be useful in debugging and deep investigation of events