Finding Causative Mutations With A Candidate SNP Approach
Aligning Reads To A Reference
Learning Objectives
- Understanding Alignment
- Know how to run the correct BWA alignment
- Understand the SAM/BAM format and relationship
Alignment is the process of finding the position on a genome that a read best matches to, so is likely to have come from.
There are many tools for aligning reads, (too many probably). One of the best general ones is BWA. BWA is a high-throughput sequence aligner used to align relatively short sequences to a reference genome. It uses the Burrows-Wheeler Transform reduce the amount of memory needed to align reads by creating a compressed index of the reference sequence. It contains two algorithms for alignment, one aln
for short reads up to around 200 bp with low error rate and another mem
for longer reads. Both are very fast and accurate.
BWA is complicated and there are lots of options to set. The crucial things we are going to need are a reference genome, and a set of reads.
Often, a sequencing strategy will use paired-end reads, where we know the distance between two reads and we can set that as a parameter. Usually the reads from a paired strategy come in two files, one of the pair in a ‘left’ file, and the other in a ‘right’ file.
BWA has a lot of options, here’s some important ones:
Maximum edit distance
- is the maximum number of nucleotides in a read that can mis-match with the reference and the read still be aligned.Maximum number of gap opens
- refers to the amount of insertions that can occur across a read.Disallow insertion/deletion within some bp towards the end
- allows for the fact that sequence quality deteriorates towards the end of a sequence and the user might not want to trust indels in the last few bases of a read.Maximum insert size for a read pair to be considered as being mapped properly
- The insert size of reads is the distance between the outer ends of the two paired-end reads.
The output format from BWA and most other aligners is SAM (Sequence Alignment Map) format. The SAM format describes the alignment of sequenced reads to a reference sequence. It stores all the alignment information generated by BWA in a simple and compact format. It provides information about the position of the read in relation to the reference genome, the number and position of nucleotides that match to the genome and the position of indels.
Here’s the top of a SAM file:
@HD VN:1.3 SO:coordinate
@SQ SN:chloroplast LN:154478
@PG ID:bwa PN:bwa VN:0.7.10-r876-dirty CL:bwa mem -t 1 -v 1
chloroplast-1781 99 chloroplast 54 60 250M = 374 570 TTA A?? NM:i:0 MD:Z:250 AS:i:250 XS:i:0
chloroplast-757 163 chloroplast 66 60 250M = 459 643 GCT A5? NM:i:6 MD:Z:46A20T65T56A23A2T32 AS:i:220 XS:i:0
chloroplast-1781 147 chloroplast 374 60 250M = 54 -570 ACT G:E NM:i:6 MD:Z:9T42G1G2G7C42A141 AS:i:220 XS:i:0
chloroplast-703 163 chloroplast 437 60 250M = 794 607 AGC ??? NM:i:8 MD:Z:68T2C20C68T16G9A30A8G21 AS:i:210 XS:i:0
- The first few lines start with stuff like
@SQ SN
which described thigs like program parameters followed by the name of the sequences in the reference file (chloroplast
) and the length of the sequence (154478). - Every line after that corresponds to each read that BWA handled.
- Each line starts with the name of the read and has a number of columns of data after it.
- In the 3rd column we can see which chromosome our read has been mapped to and the position of the first mapped base in column 4. An unmapped read will have a * in column 3 and zero in column 4.
- The 5th column is the mapping quality, a score quantifying the probability that a read is misplaced.
- The 6th column is the CIGAR string and consists of numbers followed by an upper-case letter (the operator), which describes the alignment.
As an example, a CIGAR string of 36M3D40M
means 36 matching nucleotides, followed by 3 deletions and ending in 40 matches.
A BAM file is a highly compressed, indexed binary version of SAM and through a library of different tools, allows fast, random access to the alignment. Some Galaxy tools will automatically convert the output from aligners to BAM format for you.
1. Align Paired-End Reads To A Reference Genome
Use the two sets of paired reads in the Alignment
shared data library and the ATH1_chloroplast
reference genome sequence, again these are sequences from the chloroplast genome of the model plant Arabidopsis . One set of reads, MS
, are from an Illumina MiSeq machine, are 250 nt long and have a fragment length of 650 nt. The others, GA2
are from an Illumina GAII machine, are 75 nt long and have a fragment length of 350. For the alignment you’ll be using a reference from the history.
- Which algorithm should you use for each set of reads?
- Align each of these with the BWA program in the
HTS Alignment
tools section. Choose an appropriate algorithm for each sequence set. - Pick parameters to make the two alignments as accurate and equivalent as possible? Which should differ? Which should be the same?
- Check the results with the SAMtools
idxstats
tool and other alignment stats tools likeFlagstat
andStats
inHTS SAMtools
. How do the results relate to what you know about the sequencing strategy? (What are the calculated insert sizes, what is the coverage?)
2. Alignment Quality
The HTS SAMtools
tool BAM-to-SAM
will allow you to turn the binary BAM file into a SAM file you can read.
- How good do the individual alignments look overall? Can you tell from the output? Is it useful to look at single alignments one by one.
- Does the SAM specification available here give you any further information from which you can judge the alignments?