Friday, July 13, 2012

BLAT and BLAST Output Format Fields

BLAST (-m 8)과 BLAT 결과를 보면 table 형식으로 되어 있는데 head들이 친절하게 설명되어 있는 것도있지만 대량분석할 때에는 귀찮아서 Header 정보를 제거하고 결과를 뽑아서 가끔씩 헷갈릴때가 있다. 본인은 그렇다.. :)

그래서 다시 정리를... 쿨럭.. ㅎㅎ


NCBI Blast Tabular output format fields

(Blast Head의 경우 부연 설명이 필요 없을 정도로 simple하다. :))
QueryIdSubjectId
Identity percent
AlnLength
mismatchCount
gapOpenCount
QueryStart
QueryEnd
SubjectStart
SubjectEnd
Evalue
bitScore

위의 링크된 사용자가 심플하게 parsing하는 예제를 함께 보여주고 있는데 
참고하면 좋을듯 :)


-Python 

for line in open(“myfile.blast”):
(queryId, subjectId, percIdentity, alnLength, mismatchCount, gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore) = line.split(“\t”)


-Perl 

while (<>) {
($queryId, $subjectId, $percIdentity, $alnLength, $mismatchCount, $gapOpenCount, $queryStart, $queryEnd, $subjectStart, $subjectEnd, $eVal, $bitScore) = split(/\t/)
}





BLAT Spec

matches int unsigned , # Number of bases that match that aren't repeats
misMatches int unsigned ,  # Number of bases that don't match
repMatches int unsigned ,  # Number of bases that match but are part of repeats
nCount int unsigned ,      # Number of 'N' bases
qNumInsert int unsigned ,  # Number of inserts in query
qBaseInsert int unsigned , # Number of bases inserted in query
tNumInsert int unsigned ,  # Number of inserts in target
tBaseInsert int unsigned , # Number of bases inserted in target
strand char(2) ,           # + or - for query strand, optionally followed by + or – for target strand
qName varchar(255) ,       # Query sequence name
qSize int unsigned ,       # Query sequence size
qStart int unsigned ,      # Alignment start position in query
qEnd int unsigned ,        # Alignment end position in query
tName varchar(255) ,       # Target sequence name
tSize int unsigned ,       # Target sequence size
tStart int unsigned ,      # Alignment start position in target
tEnd int unsigned ,        # Alignment end position in target
blockCount int unsigned ,  # Number of blocks in alignment. A block contains no gaps.
blockSizes longblob ,      # Size of each block in a comma separated list
qStarts longblob ,         # Start of each block in query in a comma separated list
tStarts longblob ,         # Start of each block in target in a comma separated list


-Python

for line in open(“myfile.blat”): 
(matches, misMatches, repMatches, nCount, qNumInsert, qBaseInsert, tNumInsert, tBaseInsert, strand, qName, qSize, qStart, qEnd, tName, tSize, tStart, tEnd, blockCount, blockSizes, qStarts, tStarts) = lines.split("\t")



No comments: