Chapter 3 Aligning Reads To A Reference
3.1 Learning Objectives
- Understanding Alignment
- Know how to run the correct BWA alignment
- Understand the SAM/BAM format and relationship
Culture Clash
If you’re a bioinformatician - you may be wondering why I’m not using the words ‘mapping’ reads here. Well, the geneticists in the crowd have a much older technique called mapping that does something completely different. I have honestly had conversations where it has been insisted that I can’t possibly map a read. Don’t suffer as I did.
If you’re a geneticist - bioinformaticians claim they can do amazing things with reads, mapping them down to the single read level! Of course they usually just mean they’re aligning them to whatever reference sequence they have.
3.2 Mapping and Alignment
Mapping is the process of finding the position on a genome that a read best matches to, so is likely to have come from. Alignment is the optimal alignment of the sequence of the read with the reference sequence, such that gaps and errors are allowed and small variations can be found.
3.2.1 BWA
There are many tools for aligning reads (too many, probably). One of the best general ones is BWA (Li and Durbin 2010). BWA is a high-throughput sequence aligner used to align relatively short sequences to a reference genome. It uses the Burrows-Wheeler Transform reduce the amount of memory needed to align reads by creating a compressed index of the reference sequence. It contains two algorithms for alignment, one aln
for short reads up to around 200 bp with low error rate and another mem
for longer reads. Both are very fast and accurate.
BWA is complicated and there are lots of options to set. The crucial things we are going to need are a reference genome, and a set of reads.
Often, a sequencing strategy will use paired-end reads, where we know the distance between two reads and we can set that as a parameter. Usually the reads from a paired strategy come in two files, one of the pair in a ‘left’ file, and the other in a ‘right’ file.
BWA has a lot of options, here’s some important ones:
Maximum edit distance
- is the maximum number of nucleotides in a read that can mis-match with the reference and the read still be aligned.Maximum number of gap opens
- refers to the amount of insertions that can occur across a read.Disallow insertion/deletion within some bp towards the end
- allows for the fact that sequence quality deteriorates towards the end of a sequence and the user might not want to trust indels in the last few bases of a read.Maximum insert size for a read pair to be considered as being mapped properly
- The insert size of reads is the distance between the outer ends of the two paired-end reads.
3.2.2 SAM Format
The output format from BWA and most other aligners is Sequence Alignment Map (SAM) format (Li et al. 2009). The SAM format describes the alignment of sequenced reads to a reference sequence. It stores all the alignment information generated by BWA in a simple and compact format. It provides information about the position of the read in relation to the reference genome, the number and position of nucleotides that match to the genome and the position of indels. SAM files are often analysed in packages like SAMTools (Li et al. 2009) and Picard (Picard)
Here’s the top of a SAM file:
@HD VN:1.3 SO:coordinate
@SQ SN:chloroplast LN:154478
@PG ID:bwa PN:bwa VN:0.7.10-r876-dirty CL:bwa mem -t 1 -v 1
chloroplast-1781 99 chloroplast 54 60 250M = 374 570 TTA A?? NM:i:0 MD:Z:250 AS:i:250 XS:i:0
chloroplast-757 163 chloroplast 66 60 250M = 459 643 GCT A5? NM:i:6 MD:Z:46A20T65T56A23A2T32 AS:i:220 XS:i:0
chloroplast-1781 147 chloroplast 374 60 250M = 54 -570 ACT G:E NM:i:6 MD:Z:9T42G1G2G7C42A141 AS:i:220 XS:i:0
chloroplast-703 163 chloroplast 437 60 250M = 794 607 AGC ??? NM:i:8 MD:Z:68T2C20C68T16G9A30A8G21 AS:i:210 XS:i:0
- The first few lines start with stuff like
@SQ SN
which described thigs like program parameters followed by the name of the sequences in the reference file (chloroplast
) and the length of the sequence (154478). - Every line after that corresponds to each read that BWA handled.
- Each line starts with the name of the read and has a number of columns of data after it.
- In the 3rd column we can see which chromosome our read has been mapped to and the position of the first mapped base in column 4. An unmapped read will have a * in column 3 and zero in column 4.
- The 5th column is the mapping quality, a score quantifying the probability that a read is misplaced.
- The 6th column is the CIGAR string and consists of numbers followed by an upper-case letter (the operator), which describes the alignment.
As an example, a CIGAR string of 36M3D40M
means 36 matching nucleotides, followed by 3 deletions and ending in 40 matches.
A BAM file is a highly compressed, indexed binary version of SAM and through a library of different tools, allows fast, random access to the alignment. Some Galaxy tools will automatically convert the output from aligners to BAM format for you - (BWA does this!).
3.3 Exercises
3.3.1 Align Paired-End Reads To A Reference Genome
Use the two sets of paired reads in the Alignment
shared data library and the ATH1_chloroplast
reference genome sequence, to carry out a HTS alignment. Again these are sequences from the chloroplast genome of the model plant Arabidopsis . One set of reads, MS
, are from an Illumina MiSeq machine, are 250 nt long and have a fragment length of 650 nt. The others, GA2
are from an Illumina GAII machine, are 75 nt long and have a fragment length of 350.
Please complete the section quiz at https://goo.gl/forms/l3ykUt7eNvZAM9Y42
- Which algorithm should you use for each set of reads?
- Align each of these with the BWA program in the
HTS Alignment
tools section. Choose an appropriate algorithm for each sequence set. - Pick parameters to make the two alignments as accurate and equivalent as possible? Which should differ? Which should be the same?
- Check the results with the SAMtools
idxstats
tool and other alignment stats tools likeFlagstat
andStats
inHTS SAMtools
. How do the results relate to what you know about the sequencing strategy? (What are the calculated insert sizes, what is the coverage, how many reads map?)
3.3.2 Merge Alignments Into One BAM
After alignment of two read sets from different sequencing strategies, you might want to merge all of them into one BAM file so that you can work on them as one. This isn’t just more convenient, it lets you make use of the extra information from combining the reads into one coverage pileup when calling SNPs or visualising the alignment.
Merging is a multi-stage process that can be done with Picard, the steps in Picard look like:
- Remove duplicate reads
- Sort BAMs individually
- Merge BAMs
Picard tools are available in the Picard
tool section. Try merging the BAM files.
3.3.3 Alignment Quality
The HTS SAMtools
tool BAM-to-SAM
will allow you to turn the binary BAM file into a SAM file you can read.
- How good do the individual alignments look overall? Can you tell from the output? Is it useful to look at single alignments one by one?
- In what cases might the individual alignments be useful?
References
Li, Heng, and Richard Durbin. 2010. “Fast and accurate long-read alignment with Burrows-Wheeler transform.” Bioinformatics (Oxford, England) 26 (5): 589–95.
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. “The Sequence Alignment/Map format and SAMtools.” Bioinformatics (Oxford, England) 25 (16): 2078–9.