Chapter 3 Aligning Reads To A Reference

3.1 Learning Objectives

Understanding Alignment
Know how to run the correct BWA alignment
Understand the SAM/BAM format and relationship

Culture Clash

If you’re a bioinformatician - you may be wondering why I’m not using the words ‘mapping’ reads here. Well, the geneticists in the crowd have a much older technique called mapping that does something completely different. I have honestly had conversations where it has been insisted that I can’t possibly map a read. Don’t suffer as I did.

If you’re a geneticist - bioinformaticians claim they can do amazing things with reads, mapping them down to the single read level! Of course they usually just mean they’re aligning them to whatever reference sequence they have.

3.2 Mapping and Alignment

Mapping is the process of finding the position on a genome that a read best matches to, so is likely to have come from. Alignment is the optimal alignment of the sequence of the read with the reference sequence, such that gaps and errors are allowed and small variations can be found.

3.2.1 BWA

There are many tools for aligning reads (too many, probably). One of the best general ones is BWA (Li and Durbin 2010). BWA is a high-throughput sequence aligner used to align relatively short sequences to a reference genome. It uses the Burrows-Wheeler Transform reduce the amount of memory needed to align reads by creating a compressed index of the reference sequence. It contains two algorithms for alignment, one aln for short reads up to around 200 bp with low error rate and another mem for longer reads. Both are very fast and accurate.

BWA is complicated and there are lots of options to set. The crucial things we are going to need are a reference genome, and a set of reads.

Often, a sequencing strategy will use paired-end reads, where we know the distance between two reads and we can set that as a parameter. Usually the reads from a paired strategy come in two files, one of the pair in a ‘left’ file, and the other in a ‘right’ file.

BWA has a lot of options, here’s some important ones:

Maximum edit distance - is the maximum number of nucleotides in a read that can mis-match with the reference and the read still be aligned.
Maximum number of gap opens- refers to the amount of insertions that can occur across a read.
Disallow insertion/deletion within some bp towards the end - allows for the fact that sequence quality deteriorates towards the end of a sequence and the user might not want to trust indels in the last few bases of a read.
Maximum insert size for a read pair to be considered as being mapped properly - The insert size of reads is the distance between the outer ends of the two paired-end reads.

3.2.2 SAM Format

The output format from BWA and most other aligners is Sequence Alignment Map (SAM) format (Li et al. 2009). The SAM format describes the alignment of sequenced reads to a reference sequence. It stores all the alignment information generated by BWA in a simple and compact format. It provides information about the position of the read in relation to the reference genome, the number and position of nucleotides that match to the genome and the position of indels. SAM files are often analysed in packages like SAMTools (Li et al. 2009) and Picard (Picard)

Here’s the top of a SAM file:

@HD VN:1.3  SO:coordinate
@SQ SN:chloroplast  LN:154478
@PG ID:bwa  PN:bwa  VN:0.7.10-r876-dirty    CL:bwa mem -t 1 -v 1
chloroplast-1781    99  chloroplast 54  60  250M    =   374 570 TTA A?? NM:i:0  MD:Z:250    AS:i:250    XS:i:0
chloroplast-757 163 chloroplast 66  60  250M    =   459 643 GCT A5? NM:i:6  MD:Z:46A20T65T56A23A2T32    AS:i:220    XS:i:0
chloroplast-1781    147 chloroplast 374 60  250M    =   54  -570    ACT G:E NM:i:6  MD:Z:9T42G1G2G7C42A141  AS:i:220    XS:i:0
chloroplast-703 163 chloroplast 437 60  250M    =   794 607 AGC ??? NM:i:8  MD:Z:68T2C20C68T16G9A30A8G21    AS:i:210    XS:i:0

The first few lines start with stuff like @SQ SN which described thigs like program parameters followed by the name of the sequences in the reference file (chloroplast) and the length of the sequence (154478).
Every line after that corresponds to each read that BWA handled.
Each line starts with the name of the read and has a number of columns of data after it.
In the 3rd column we can see which chromosome our read has been mapped to and the position of the first mapped base in column 4. An unmapped read will have a * in column 3 and zero in column 4.
The 5th column is the mapping quality, a score quantifying the probability that a read is misplaced.
The 6th column is the CIGAR string and consists of numbers followed by an upper-case letter (the operator), which describes the alignment.

As an example, a CIGAR string of 36M3D40M means 36 matching nucleotides, followed by 3 deletions and ending in 40 matches.

A BAM file is a highly compressed, indexed binary version of SAM and through a library of different tools, allows fast, random access to the alignment. Some Galaxy tools will automatically convert the output from aligners to BAM format for you - (BWA does this!).

3.3 Exercises

3.3.1 Align Paired-End Reads To A Reference Genome

Use the two sets of paired reads in the Alignment shared data library and the ATH1_chloroplast reference genome sequence, to carry out a HTS alignment. Again these are sequences from the chloroplast genome of the model plant Arabidopsis . One set of reads, MS, are from an Illumina MiSeq machine, are 250 nt long and have a fragment length of 650 nt. The others, GA2 are from an Illumina GAII machine, are 75 nt long and have a fragment length of 350.

Please complete the section quiz at https://goo.gl/forms/l3ykUt7eNvZAM9Y42

Which algorithm should you use for each set of reads?
Align each of these with the BWA program in the HTS Alignment tools section. Choose an appropriate algorithm for each sequence set.
Pick parameters to make the two alignments as accurate and equivalent as possible? Which should differ? Which should be the same?
Check the results with the SAMtools idxstats tool and other alignment stats tools like Flagstat and Stats in HTS SAMtools. How do the results relate to what you know about the sequencing strategy? (What are the calculated insert sizes, what is the coverage, how many reads map?)

3.3.2 Merge Alignments Into One BAM

After alignment of two read sets from different sequencing strategies, you might want to merge all of them into one BAM file so that you can work on them as one. This isn’t just more convenient, it lets you make use of the extra information from combining the reads into one coverage pileup when calling SNPs or visualising the alignment.

Merging is a multi-stage process that can be done with Picard, the steps in Picard look like:

Remove duplicate reads
Sort BAMs individually
Merge BAMs

Picard tools are available in the Picard tool section. Try merging the BAM files.

3.3.3 Alignment Quality

The HTS SAMtools tool BAM-to-SAM will allow you to turn the binary BAM file into a SAM file you can read.

How good do the individual alignments look overall? Can you tell from the output? Is it useful to look at single alignments one by one?
In what cases might the individual alignments be useful?

References

Li, Heng, and Richard Durbin. 2010. “Fast and accurate long-read alignment with Burrows-Wheeler transform.” Bioinformatics (Oxford, England) 26 (5): 589–95.

Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. “The Sequence Alignment/Map format and SAMtools.” Bioinformatics (Oxford, England) 25 (16): 2078–9.