Wednesday, August 10, 2016

Comparing bamcheck and samtools stats - finding inward and outward oriented read pairs from bam file

In addition to the biostar post, I am adding more details in this blog post.

I wanted to find out the inward and outward orientation pairs from a bam file. So, I checked a few tools. Two of them were bamcheck and samtools stats. The reason for doing so, if the bam shows that there are more number of outward oriented pairs, that means there are higher chances that the region has some problem associated with it.

But, I was stuck because both these tools gave different outputs. So, I myself wanted to find out which is one more reliable.





This subsequently led me run the following steps.

1) Split the bam into smaller bam files, (1 scaffold and its associated mapped reads)

$ bamtools split -in 500bp_insert_mapping_subset2_sorted.bam -reference

$ ls
500bp_insert_mapping_subset2_sorted.bam                   500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam

2) Count the inward and oriented pairs, using both bamcheck 

$ for i in *bam; do echo "$i"; bamcheck $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t
500bp_insert_mapping_subset2_sorted.bam                   SN  inward  oriented  pairs:  95506  SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  SN  inward  oriented  pairs:  2801   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   SN  inward  oriented  pairs:  2861   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  SN  inward  oriented  pairs:  2473   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  SN  inward  oriented  pairs:  2844   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   SN  inward  oriented  pairs:  2620   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  SN  inward  oriented  pairs:  3212   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   SN  inward  oriented  pairs:  3543   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  SN  inward  oriented  pairs:  3018   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  SN  inward  oriented  pairs:  3671   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  SN  inward  oriented  pairs:  3530   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  SN  inward  oriented  pairs:  3093   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  SN  inward  oriented  pairs:  2951   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  SN  inward  oriented  pairs:  1545   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  SN  inward  oriented  pairs:  3242   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam  SN  inward  oriented  pairs:  2252   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam  SN  inward  oriented  pairs:  3876   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam  SN  inward  oriented  pairs:  3718   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam  SN  inward  oriented  pairs:  4269   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam  SN  inward  oriented  pairs:  2493   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam  SN  inward  oriented  pairs:  3140   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam  SN  inward  oriented  pairs:  4837   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam  SN  inward  oriented  pairs:  6686   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam  SN  inward  oriented  pairs:  2668   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam  SN  inward  oriented  pairs:  3145   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam   SN  inward  oriented  pairs:  2622   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam   SN  inward  oriented  pairs:  2729   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam   SN  inward  oriented  pairs:  2880   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam   SN  inward  oriented  pairs:  2668   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam   SN  inward  oriented  pairs:  4305   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam   SN  inward  oriented  pairs:  1814   SN  outward  oriented  pairs:  0

2) Count the inward and oriented pairs, using samtools stats


$ for i in *bam; do echo "$i"; ~/Documents/samtools-1.2/samtools stats $i | egrep '^SN.*inward|^SN.*outward'; done | paste - - - | column -t
500bp_insert_mapping_subset2_sorted.bam                   SN  inward  oriented  pairs:  96131  SN  outward  oriented  pairs:  295
500bp_insert_mapping_subset2_sorted.REF__unitig_1023.bam  SN  inward  oriented  pairs:  2807   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_107.bam   SN  inward  oriented  pairs:  2861   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1117.bam  SN  inward  oriented  pairs:  2474   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1171.bam  SN  inward  oriented  pairs:  2845   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_118.bam   SN  inward  oriented  pairs:  2622   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1226.bam  SN  inward  oriented  pairs:  3221   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_134.bam   SN  inward  oriented  pairs:  3553   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1403.bam  SN  inward  oriented  pairs:  3020   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1486.bam  SN  inward  oriented  pairs:  3677   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1527.bam  SN  inward  oriented  pairs:  3545   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1589.bam  SN  inward  oriented  pairs:  3097   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1630.bam  SN  inward  oriented  pairs:  2961   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1686.bam  SN  inward  oriented  pairs:  1548   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1812.bam  SN  inward  oriented  pairs:  3251   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1853.bam  SN  inward  oriented  pairs:  2272   SN  outward  oriented  pairs:  24
500bp_insert_mapping_subset2_sorted.REF__unitig_1900.bam  SN  inward  oriented  pairs:  3885   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1933.bam  SN  inward  oriented  pairs:  3727   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_1992.bam  SN  inward  oriented  pairs:  4279   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2047.bam  SN  inward  oriented  pairs:  2850   SN  outward  oriented  pairs:  227
500bp_insert_mapping_subset2_sorted.REF__unitig_2117.bam  SN  inward  oriented  pairs:  3176   SN  outward  oriented  pairs:  14
500bp_insert_mapping_subset2_sorted.REF__unitig_2142.bam  SN  inward  oriented  pairs:  4841   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam  SN  inward  oriented  pairs:  6735   SN  outward  oriented  pairs:  24
500bp_insert_mapping_subset2_sorted.REF__unitig_2163.bam  SN  inward  oriented  pairs:  2679   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_2214.bam  SN  inward  oriented  pairs:  3150   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_228.bam   SN  inward  oriented  pairs:  2625   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_239.bam   SN  inward  oriented  pairs:  2732   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_312.bam   SN  inward  oriented  pairs:  2885   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_315.bam   SN  inward  oriented  pairs:  2671   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_363.bam   SN  inward  oriented  pairs:  4312   SN  outward  oriented  pairs:  0
500bp_insert_mapping_subset2_sorted.REF__unitig_992.bam   SN  inward  oriented  pairs:  1830   SN  outward  oriented  pairs:  6

3) I selected this small BAM file: 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam

Bamcheck stats

$ bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam | head -28
# This file was produced by bamcheck (2012-04-24)
# The command line was:  bamcheck 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN sequences: 14606
SN is paired: 1
SN is sorted: 1
SN 1st fragments: 7843
SN last fragments: 6763
SN reads mapped: 14606
SN reads unmapped: 0
SN reads unpaired: 1234
SN reads paired: 13372
SN reads duplicated: 0
SN reads MQ0: 9217
SN total length: 1437208
SN bases mapped: 1437208
SN bases mapped (cigar): 1436380
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 0
SN error rate: 0.000000e+00
SN average length: 98
SN maximum length: 101
SN average quality: 36.8
SN insert size average: 497.2
SN insert size standard deviation: 29.1
SN inward oriented pairs: 6686
SN outward oriented pairs: 0

Samtools stats

$ samtools stats 500bp_insert_mapping_subset2_sorted.REF__unitig_2151_outties_filtered.bam | head -37
SN raw total sequences: 14606
SN filtered sequences: 0
SN sequences: 14606
SN is sorted: 1
SN 1st fragments: 7843
SN last fragments: 6763
SN reads mapped: 14606
SN reads mapped and paired: 13526 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 13372 # proper-pair bit set
SN reads paired: 13526 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 9217 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 0
SN total length: 1437208 # ignores clipping
SN bases mapped: 1437208 # ignores clipping
SN bases mapped (cigar): 1436380 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 0 # from NM fields
SN error rate: 0.000000e+00 # mismatches / bases mapped (cigar)
SN average length: 98
SN maximum length: 101
SN average quality: 36.8
SN insert size average: 491.7
SN insert size standard deviation: 29.4
SN inward oriented pairs: 6735
SN outward oriented pairs: 24

Observe here, (6735+24)*2 != 13,526 but equals to 13,518 paired reads. (check from step 7)

4) First take out all the mapped read pairs (irrespective of direction) from the bam and convert into sorted bed file. 

$ fgrep -f <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam \|
cut -f4 | cut -d "_" -f1 \|
sort | uniq -c | sort -n \| 
fgrep '      2' | sed 's/      2 //g') <(bedtools bamtobed -i 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bam) \|
sort -k4,4 >500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed

5) Checking the numbers of reads paired

$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \| 
awk '{print $4,$2,$3,$6}' \| 
paste - - | wc -l

6763 (reads paired: 13526) - This number is same as the number generated by samtools stats see above SN reads paired: 13526.

But the bamcheck was giving a different result (SN reads paired: 13372)

6) Now, from the read pairs, I want to know how many are 

  1. Inward oriented pairs (--> <--)
  2. Outward oriented pairs (<-- -->)
  3. Same direction pairs (-->--> or <--<--)
7) At first I tried running the following command, which gave me slightly higher number than the samtools stats given number

$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \|
awk '{print $4,$2,$3,$6}' | paste - - \|
awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else print $0;}' \|
sed 's/-/<@@@/g' | sed 's/+/###>/g' | awk '{print $1,$2,$3,$5,$6,$7,$4$8}' \|
column -t | awk '{print $NF}' | sort | uniq -c

    27  <--()-->
   6736 -->()<--

(27 + 6736)*2 =13,526 paired reads.

But, samtools stats gives (6735+24 ) * 2 = 13518 reads are paired reads. I tried understand deeply and scrutinizing the bed file I found that 3 reads pairs are overlapping at the same position for outward oriented pairs and 1 overlapping read pair from the inward oriented pairs.

From overlapping, I mean the coordinates of both reads are mapping around the same point.

HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/1   8700   8798   HISEQ:120:H0BJ3ADXX:2:2104:7360:21274_2:N:0:ACAGTG/2   8700   8798   <-- -->
HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/1   8702   8796   HISEQ:120:H0BJ3ADXX:1:1112:2349:75045_2:N:0:ACAGTG/2   8702   8794   <-- -->
HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/1   8717   8814   HISEQ:120:H0BJ3ADXX:2:1213:8095:95602_2:N:0:ACAGTG/2   8717   8814   <-- -->

These were problematic ones!

8) So, I made slight change in the above command


$ cat 500bp_insert_mapping_subset2_sorted.REF__unitig_2151.bed \|
awk '{print $4,$2,$3,$6}' | paste - - \|
awk '{if ($2 > $6) print $5,$6,$7,$8,$1,$2,$3,$4; else if ( $2 == $6) next; else print $0;}' \|
sed 's/-/<--/g' | sed 's/+/-->/g' \|
awk '{print $1,$2,$3,$5,$6,$7,$4"()"$8}' \|
column -t | awk '{print $NF}' \|
sort | uniq -c

     24 <--()-->
   6735 -->()<--

which is same as samtools stats command.

I have checked these reads in a visualization tool called tablet. It shows the directions of the pairs when the bam file is loaded into the tool.

**Note:

The final script I used is not a one-liner to find the orientation of paired reads for bigger bam files, because other conditions have to be taken in consideration too. Such as overlapping pairs with different starting coordinates (in my script I used only same coordinates of both read pairs to report).

No comments:

Post a Comment