mavis/convert/vcf
PANDAS_DEFAULT_NA_VALUES
PANDAS_DEFAULT_NA_VALUES = [
'-1.#IND',
'1.#QNAN',
'1.#IND',
'-1.#QNAN',
'#N/A',
'N/A',
'NA',
'#NA',
'NULL',
'NaN',
'-NaN',
'nan',
'-nan',
]
class VcfInfoType
inherits TypedDict
Attributes
- SVTYPE (
str
) - CHR2 (
str
) - CIPOS (
Tuple[int, int]
) - CIEND (
Tuple[int, int]
) - CILEN (
Tuple[int, int]
) - CT (
str
) - END (
Optional[int]
) - PRECISE (
bool
)
class VcfRecordType
Attributes
- id (
str
) - pos (
int
) - chrom (
str
) - alts (
List[Optional[str]]
) - info (VcfInfoType)
- ref (
str
)
parse_bnd_alt()
parses the alt statement from vcf files using the specification in vcf 4.2/4.2.
Assumes that the reference base is always the outermost base (this is based on the spec and also manta results as the spec was missing some cases)
r = reference base/seq u = untemplated sequence/alternate sequence p = chromosome:position
alt format | orients |
---|---|
ru[p[ | LR |
[p[ur | RR |
]p]ur | RL |
ru]p] | LL |
def parse_bnd_alt(alt: str) -> Tuple[str, int, str, str, str, str]:
Args
- alt (
str
)
Returns
Tuple[str, int, str, str, str, str]
convert_imprecise_breakend()
Handles IMPRECISE calls, that leveraged uncertainty from the CIPOS/CIEND/CILEN fields.
bp1_s = breakpoint1 start bp1_e = breakpoint1 end bp2_s = breakpoint2 start bp2_e = breakpoint2 end
Insertion and deletion edge case - in which bp1_e > bp2_s E.g bp1_s = 1890, bp1_e = 2000, bp2_s = 1900, bp2_e = 1900. break1 ------------------------=======================-------------- break2 ------------------------==========---------------------------
Insertion edge case - in which bp1_e > bp1_s E.g bp1_s = 1890, bp1_e = 1800, bp2_s = 1800, bp2_e = 1800. break1 ------------------------==----------------------------------- break2 ------------------------=------------------------------------
Insertion edge case - in which bp1_s > bp2_s E.g bp1_s = 1950, bp1_e = 2000, bp2_s = 1900, bp2_e = 3000. break1 ------------------------==----------------------------------- break2 -----------------------========------------------------------
def convert_imprecise_breakend(std_row: Dict, record: List[VcfRecordType], bp_end: int):
Args
- std_row (
Dict
) - record (List[VcfRecordType])
- bp_end (
int
)
convert_record()
converts a vcf record
def convert_record(record: VcfRecordType) -> List[Dict]:
Args
- record (VcfRecordType)
Returns
List[Dict]
Note
CT = connection type, If given this field will be used in determining the orientation at the breakpoints. From https://groups.google.com/forum/#!topic/delly-users/6Mq2juBraRY, we can expect certain CT types for certain event types - translocation/inverted translocation: 3to3, 3to5, 5to3, 5to5 - inversion: 3to3, 5to5 - deletion: 3to5 - duplication: 5to3
pandas_vcf()
Read a standard vcf file into a pandas dataframe
def pandas_vcf(input_file: str) -> Tuple[List[str], pd.DataFrame]:
Args
- input_file (
str
)
Returns
Tuple[List[str], pd.DataFrame]
convert_file()
process a VCF file
def convert_file(input_file: str) -> List[Dict]:
Args
- input_file (
str
): the input file name
Returns
List[Dict]
Raises
err
: [description]