MAVIS (Full) Tutorial

The following tutorial is an introduction to running MAVIS. You will need to download the tutorial data. Additionally the instructions pertain to running MAVIS on a SLURM cluster. This tutorial will require more resources than the mini-tutorial above.

Getting the Tutorial Data

The tutorial data can be downloaded from the link below. Note that it may take a while as the download is \~29GB

wget http://www.bcgsc.ca/downloads/mavis/tutorial_data.tar.gz
tar -xvzf tutorial_data.tar.gz

The expected contents are

Path	Description
README	Information regarding the other files in the directory
L1522785992_expected_events.tab	The events that we expect to find, either experimentally validated or 'spiked' in
L1522785992_normal.sorted.bam	Paired normal library BAM file
L1522785992_normal.sorted.bam.bai	BAM index
L1522785992_trans.sorted.bam	Tumour transcriptome BAM file
L1522785992_trans.sorted.bam.bai	BAM index file
L1522785992_tumour.sorted.bam	Tumour genome BAM file
L1522785992_tumour.sorted.bam.bai	BAM index file
breakdancer-1.4.5/	Contains the BreakDancer output which was run on the tumour genome BAM file
breakseq-2.2/	Contains the BreakSeq output which was run on the tumour genome BAM file
chimerascan-0.4.5/	Contains the ChimeraScan output which was run on the tumour transcriptome BAM file
defuse-0.6.2/	Contains the deFuse output which was run on the tumour transcriptome BAM file
manta-1.0.0/	Contains the Manta output which was run on the tumour genome and paired normal genome BAM files

Downloading the Reference Inputs

Run the following to download the hg19 reference files

wget https://raw.githubusercontent.com/bcgsc/mavis/master/src/tools/get_hg19_reference_files.sh
mkdir reference_inputs
cd reference_inputs
bash get_hg19_reference_files.sh
cd ..

Creating the Config File

Most settings can be left as defaults, however you will need to fill out the libraries and convert sections to tell MAVIS how to convert your inputs and what libraries to expect.

Libraries Settings

For this example, because we want to determine which events are germline/somatic we are going to pass all genome calls to both genomes. We can use either full file paths (if the input is already in the standard format) or the alias from a conversion (the first argument given to the convert option)

{
    "libraries": {
        "L1522785992-normal": { // keyed by library name
            "assign": [ // these are the names of the input files (or conversion aliases) to check for this library
                "breakdancer",
                "breakseq",
                "manta"
            ],
            "bam_file": "tutorial_data/L1522785992_normal.sorted.bam",
            "disease_status": "normal",
            "protocol": "genome"
        },
        "L1522785992-trans": {
            "assign": [
                "chimerascan",
                "defuse"
            ],
            "bam_file": "tutorial_data/L1522785992_trans.sorted.bam",
            "disease_status": "diseased",
            "protocol": "transcriptome",
            "strand_specific": true
        },
        "L1522785992-tumour": {
            "assign": [
                "breakdancer",
                "breakseq",
                "manta"
            ],
            "bam_file": "tutorial_data/L1522785992_tumour.sorted.bam",
            "disease_status": "diseased",
            "protocol": "genome"
        }
    }
}

Convert Settings

If they are raw tool output as in the current example you will need to use the convert argument to tell MAVIS the file type

{
    "convert": {
        "breakdancer": {  // conversion alias/key
            "assume_no_untemplated": true,
            "file_type": "breakdancer",  // input/file type
            "inputs": [
                "tutorial_data/breakdancer-1.4.5/*txt"
            ]
        },
        "breakseq": {
            "assume_no_untemplated": true,
            "file_type": "breakseq",
            "inputs": [
                "tutorial_data/breakseq-2.2/breakseq.vcf.gz"
            ]
        },
        "chimerascan": {
            "assume_no_untemplated": true,
            "file_type": "chimerascan",
            "inputs": [
                "tutorial_data/chimerascan-0.4.5/chimeras.bedpe"
            ]
        },
        "defuse": {
            "assume_no_untemplated": true,
            "file_type": "defuse",
            "inputs": [
                "tutorial_data/defuse-0.6.2/results.classify.tsv"
            ]
        },
        "manta": {
            "assume_no_untemplated": true,
            "file_type": "manta",
            "inputs": [
                "tutorial_data/manta-1.0.0/diploidSV.vcf.gz",
                "tutorial_data/manta-1.0.0/somaticSV.vcf"
            ]
        }
    }
}

Top-level Settings

Finally you will need to set output directory and the reference files

{
  "output_dir": "output_dir_full",  // where to output files
  "reference.aligner_reference": [
      "reference_inputs/hg19.2bit"
  ],
  "reference.annotations": [
      "reference_inputs/ensembl69_hg19_annotations.v3.json"
  ],
  "reference.dgv_annotation": [
      "reference_inputs/dgv_hg19_variants.tab"
  ],
  "reference.masking": [
      "reference_inputs/hg19_masking.tab"
  ],
  "reference.reference_genome": [
      "reference_inputs/hg19.fa"
  ],
  "reference.template_metadata": [
      "reference_inputs/cytoBand.txt"
  ]
}

Running the Workflow

In order to run the snakemake file you will need to have the config validation module mavis_config installed which has minimal dependencies.

pip install mavis_config

You are now ready to run the workflow

snakemake --jobs 100 --configfile=tests/full-tutorial.config.json

Analyzing the Output

The best place to start with looking at the MAVIS output is the summary folder which contains the final results. For column name definitions see the glossary.

output_dir/summary/mavis_summary_all_L1522785992-normal_L1522785992-trans_L1522785992-tumour.tab