mavis/validate/blat
In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
-- http://wiki.bits.vib.be/index.php/Blat
class Blat
Blat.millibad()
this function is used in calculating percent identity direct translation of the perl code (https://genome.ucsc.edu/FAQ/FAQblat.html#blat4)
@staticmethod
def millibad(row: Dict, is_protein: bool = False, is_mrna: bool = True) -> float:
Args
- row (
Dict
) - is_protein (
bool
) - is_mrna (
bool
)
Returns
float
Blat.score()
direct translation from ucsc guidelines on replicating the web blat score https://genome.ucsc.edu/FAQ/FAQblat.html#blat4
below are lines from the perl code i've re-written in python
my $sizeMul = pslIsProtein($blockCount, $strand, $tStart, $tEnd, $tSize, $tStarts, $blockSizes);
sizmul = 1 for DNA
my $pslScore = $sizeMul * ($matches + ($repMatches >> 1) ) - $sizeMul * $misMatches - $qNumInsert - $tNumIns
ert)
@staticmethod
def score(row: Dict, is_protein: bool = False) -> int:
Args
- row (
Dict
) - is_protein (
bool
)
Returns
int
Blat.pslx_row_to_pysam()
given a 'row' from reading a pslx file. converts the row to a BlatAlignedSegment object
@staticmethod
def pslx_row_to_pysam(
row: Dict, bam_cache: BamCache, reference_genome: ReferenceGenome
) -> SamRead:
Args
- row (
Dict
): a row object from the 'read_pslx' method - bam_cache (BamCache): the bam file/cache to use as a template for creating reference_id from chr name
- reference_genome (ReferenceGenome): reference sequence by template/chr name
Returns
process_blat_output()
converts the blat output pslx (unheadered file) to bam reads
def process_blat_output(
input_bam_cache: BamCache,
query_id_mapping: Dict[str, str],
reference_genome: ReferenceGenome,
aligner_output_file: str = 'aligner_out.temp',
blat_min_percent_of_max_score: float = 0.8,
blat_min_identity: float = 0.7,
blat_limit_top_aln: int = 25,
is_protein: bool = False,
) -> Dict[str, List[SamRead]]:
Args
- input_bam_cache (BamCache)
- query_id_mapping (
Dict[str, str]
) - reference_genome (ReferenceGenome)
- aligner_output_file (
str
) - blat_min_percent_of_max_score (
float
) - blat_min_identity (
float
) - blat_limit_top_aln (
int
) - is_protein (
bool
)
Returns
- Dict[
str
, List[SamRead]]