Finding Causative Mutations With A Candidate SNP Approach

Aligning Reads To A Reference

Learning Objectives

  • Understanding Alignment
  • Know how to run the correct BWA alignment
  • Understand the SAM/BAM format and relationship

Alignment is the process of finding the position on a genome that a read best matches to, so is likely to have come from.

There are many tools for aligning reads, (too many probably). One of the best general ones is BWA. BWA is a high-throughput sequence aligner used to align relatively short sequences to a reference genome. It uses the Burrows-Wheeler Transform reduce the amount of memory needed to align reads by creating a compressed index of the reference sequence. It contains two algorithms for alignment, one aln for short reads up to around 200 bp with low error rate and another mem for longer reads. Both are very fast and accurate.

BWA is complicated and there are lots of options to set. The crucial things we are going to need are a reference genome, and a set of reads.

Often, a sequencing strategy will use paired-end reads, where we know the distance between two reads and we can set that as a parameter. Usually the reads from a paired strategy come in two files, one of the pair in a ‘left’ file, and the other in a ‘right’ file.

BWA has a lot of options, here’s some important ones:

  1. Maximum edit distance - is the maximum number of nucleotides in a read that can mis-match with the reference and the read still be aligned.
  2. Maximum number of gap opens- refers to the amount of insertions that can occur across a read.
  3. Disallow insertion/deletion within some bp towards the end - allows for the fact that sequence quality deteriorates towards the end of a sequence and the user might not want to trust indels in the last few bases of a read.
  4. Maximum insert size for a read pair to be considered as being mapped properly - The insert size of reads is the distance between the outer ends of the two paired-end reads.

The output format from BWA and most other aligners is SAM (Sequence Alignment Map) format. The SAM format describes the alignment of sequenced reads to a reference sequence. It stores all the alignment information generated by BWA in a simple and compact format. It provides information about the position of the read in relation to the reference genome, the number and position of nucleotides that match to the genome and the position of indels.

Here’s the top of a SAM file:

@HD VN:1.3  SO:coordinate
@SQ SN:chloroplast  LN:154478
@PG ID:bwa  PN:bwa  VN:0.7.10-r876-dirty    CL:bwa mem -t 1 -v 1
chloroplast-1781    99  chloroplast 54  60  250M    =   374 570 TTA A?? NM:i:0  MD:Z:250    AS:i:250    XS:i:0
chloroplast-757 163 chloroplast 66  60  250M    =   459 643 GCT A5? NM:i:6  MD:Z:46A20T65T56A23A2T32    AS:i:220    XS:i:0
chloroplast-1781    147 chloroplast 374 60  250M    =   54  -570    ACT G:E NM:i:6  MD:Z:9T42G1G2G7C42A141  AS:i:220    XS:i:0
chloroplast-703 163 chloroplast 437 60  250M    =   794 607 AGC ??? NM:i:8  MD:Z:68T2C20C68T16G9A30A8G21    AS:i:210    XS:i:0
  1. The first few lines start with stuff like @SQ SN which described thigs like program parameters followed by the name of the sequences in the reference file (chloroplast) and the length of the sequence (154478).
  2. Every line after that corresponds to each read that BWA handled.
  3. Each line starts with the name of the read and has a number of columns of data after it.
  4. In the 3rd column we can see which chromosome our read has been mapped to and the position of the first mapped base in column 4. An unmapped read will have a * in column 3 and zero in column 4.
  5. The 5th column is the mapping quality, a score quantifying the probability that a read is misplaced.
  6. The 6th column is the CIGAR string and consists of numbers followed by an upper-case letter (the operator), which describes the alignment.

As an example, a CIGAR string of 36M3D40M means 36 matching nucleotides, followed by 3 deletions and ending in 40 matches.

A BAM file is a highly compressed, indexed binary version of SAM and through a library of different tools, allows fast, random access to the alignment. Some Galaxy tools will automatically convert the output from aligners to BAM format for you.

1. Align Paired-End Reads To A Reference Genome

Use the two sets of paired reads in the Alignment shared data library and the ATH1_chloroplast reference genome sequence, again these are sequences from the chloroplast genome of the model plant Arabidopsis . One set of reads, MS, are from an Illumina MiSeq machine, are 250 nt long and have a fragment length of 650 nt. The others, GA2 are from an Illumina GAII machine, are 75 nt long and have a fragment length of 350. For the alignment you’ll be using a reference from the history.

  1. Which algorithm should you use for each set of reads?
  2. Align each of these with the BWA program in the HTS Alignment tools section. Choose an appropriate algorithm for each sequence set.
  3. Pick parameters to make the two alignments as accurate and equivalent as possible? Which should differ? Which should be the same?
  4. Check the results with the SAMtools idxstats tool and other alignment stats tools like Flagstat and Stats in HTS SAMtools. How do the results relate to what you know about the sequencing strategy? (What are the calculated insert sizes, what is the coverage?)

2. Alignment Quality

The HTS SAMtools tool BAM-to-SAMwill allow you to turn the binary BAM file into a SAM file you can read.

  1. How good do the individual alignments look overall? Can you tell from the output? Is it useful to look at single alignments one by one.
  2. Does the SAM specification available here give you any further information from which you can judge the alignments?