import sys
print( sys.platform )
darwin
Packages are an important part of the Python ‘ecosystem’ that let us use a wide variety of code written by other people. Good packages provide functions, objects and methods that make working in a particular problem domain a lot easier.
We’ve already seen how to load in packages using the import
statement. Here we load the builit-in sys
package which has methods for working with the underlying system - the platform
is the name of the operating system this code is running on.
import sys
print( sys.platform )
darwin
Some packages, like sys
come as standard when Python is installed, some come from external sources and must be installed individually. There are multiple methods of installing packages. Assuming you set up Python using Anaconda for this book then you can use conda
itself. Another more general way of installing packages is from PyPI, the Python Package Index.
None of these are done from within Python itself, they must be done from the regular command-line. The general form is:
conda install package_name
pip install package_name
VCF files (variant call format) are files that describe SNPs and small indels in alignments from high-throughput sequencing data. The vcfpy
package provides a lot of functionality for working with such files and the record they contain.
A VCF file is a column format file where each row represents a SNP/Indel record and the columns represent things describing it like Chromosome, Position, Reference Genome Allele, Alternative Genome Allele etc. Here’s what one record looks like, split over a few lines.
CHROM | POS | ID | REF | ALT | QUAL | FILTER |
---|---|---|---|---|---|---|
20 | 14370 | rs6054257 | G | A | 29 | PASS |
INFO | FORMAT | NA00001 | NA00002 |
---|---|---|---|
NS=3;DP=14;AF=0.5;DB;H2 | GT:GQ:DP:HQ | 0|0:48:1:51,51 | 1|0:48:8:51,51 |
NA00003 |
---|
1/1:43:5:.,. |
There’s a lot of information in that one record, and each file has many tens of thousands of records! We wouldn’t want to try and process tens of thousands manually. Let’s look at loading in a file and looping through each record using the vcfpy
package.
We’ve already seen that packages have functions we can call, and that doing so can sometimes return objects of new types. We’ll use that pattern now to start processing a VCF file.
vcf.Reader
objectThe first package function we’ll need to use is vcf.Reader
which opens and connects to a file but doesn’t do anything with it. It gives us a reader object we can use to access the file. We just need the file name that we wish to open.
import vcfpy
= vcfpy.Reader(open('/Users/macleand/Desktop/example.vcf', 'r')) vcf_reader
Now we can go ahead and extract the VCF records by using the reader object we created in a loop
for record in vcf_reader:
print( record )
print(type(record))
Record('20', 14370, ['rs6054257'], 'G', [Substitution(type_='SNV', value='A')], 29, ['PASS'], {'NS': 3, 'DP': 14, 'AF': [0.5], 'DB': True, 'H2': True}, ['GT', 'GQ', 'DP', 'HQ'], [Call('NA00001', {'GT': '0|0', 'GQ': 48, 'DP': 1, 'HQ': [51, 51]}), Call('NA00002', {'GT': '1|0', 'GQ': 48, 'DP': 8, 'HQ': [51, 51]}), Call('NA00003', {'GT': '1/1', 'GQ': 43, 'DP': 5, 'HQ': [None, None]})])
<class 'vcfpy.record.Record'>
Record('20', 17330, [], 'T', [Substitution(type_='SNV', value='A')], 3, ['q10'], {'NS': 3, 'DP': 11, 'AF': [0.017]}, ['GT', 'GQ', 'DP', 'HQ'], [Call('NA00001', {'GT': '0|0', 'GQ': 49, 'DP': 3, 'HQ': [58, 50]}), Call('NA00002', {'GT': '0|1', 'GQ': 3, 'DP': 5, 'HQ': [65, 3]}), Call('NA00003', {'GT': '0/0', 'GQ': 41, 'DP': 3})])
<class 'vcfpy.record.Record'>
Record('20', 1110696, ['rs6040355'], 'A', [Substitution(type_='SNV', value='G'), Substitution(type_='SNV', value='T')], 67, ['PASS'], {'NS': 2, 'DP': 10, 'AF': [0.333, 0.667], 'AA': 'T', 'DB': True}, ['GT', 'GQ', 'DP', 'HQ'], [Call('NA00001', {'GT': '1|2', 'GQ': 21, 'DP': 6, 'HQ': [23, 27]}), Call('NA00002', {'GT': '2|1', 'GQ': 2, 'DP': 0, 'HQ': [18, 2]}), Call('NA00003', {'GT': '2/2', 'GQ': 35, 'DP': 4})])
<class 'vcfpy.record.Record'>
Record('20', 1230237, [], 'T', [], 47, ['PASS'], {'NS': 3, 'DP': 13, 'AA': 'T'}, ['GT', 'GQ', 'DP', 'HQ'], [Call('NA00001', {'GT': '0|0', 'GQ': 54, 'DP': 7, 'HQ': [56, 60]}), Call('NA00002', {'GT': '0|0', 'GQ': 48, 'DP': 4, 'HQ': [51, 51]}), Call('NA00003', {'GT': '0/0', 'GQ': 61, 'DP': 2})])
<class 'vcfpy.record.Record'>
Record('20', 1234567, ['microsat1'], 'GTCT', [Substitution(type_='DEL', value='G'), Substitution(type_='INDEL', value='GTACT')], 50, ['PASS'], {'NS': 3, 'DP': 9, 'AA': 'G'}, ['GT', 'GQ', 'DP'], [Call('NA00001', {'GT': '0/1', 'GQ': 35, 'DP': 4}), Call('NA00002', {'GT': '0/2', 'GQ': 17, 'DP': 2}), Call('NA00003', {'GT': '1/1', 'GQ': 40, 'DP': 3})])
<class 'vcfpy.record.Record'>
This does the code loop for every record in the file. And note how print()
gives only a rough printout of the object, not every bit of information.
vcf.Record
objectEvery time we loop we get a new vcf.Record
object. This is a special sort of object provided by the vcfpy
package that represents the VCF record. We already saw how objects have methods - special functions that apply straight to the data in that object. PyVCF
is no exception. As well as methods, objects can have properties called attributes.
As well as methods, objects have things called attributes.
Attributes of an object are different from methods of an object in that they tend to be things that just are rather than things that are computed. So a hypothetical shape
object representing a geometric shape might have an attribute called sides
. This wouldn’t need recomputing as it would be the same every time.
Attributes are easy to spot as they use the object
dot .
syntax, omitting the brackets at the end of the attribute name. Let’s examine that using the .POS
attribute.
= vcfpy.Reader(open('/Users/macleand/Desktop/example.vcf', 'r'))
vcf_reader
for record in vcf_reader:
print( record.POS )
14370
17330
1110696
1230237
1234567
Here we can see that we get the position of each SNP/Indel in the chromosome from .POS
. Note that we had to redo the vcf_reader creation - this is because the vcf.Reader
object has in a sense “got to the end” of the data the first time we used it. It needs re-winding to the start and one way to do that is to recreate it and thereby reset it.
If we want to find out what methods or attributes an object has we can use the dir()
function which gives us these for an object.
print( dir(vcf_reader) )
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'close', 'fetch', 'from_path', 'from_stream', 'header', 'parsed_samples', 'parser', 'path', 'record_checks', 'stream', 'tabix_file', 'tabix_iter', 'tabix_path']
So the list is really large! Some methods and attributes begin with _
underscore characters. By convention these are things that are used internally by the object - so you don’t need to worry about those. We can filter the ones we can use, by building a list of ones that don’t start with ’_’
= dir(vcf_reader)
attrs = []
keepers for a in attrs:
if not a.startswith("_"):
keepers.append(a)print( keepers )
['close', 'fetch', 'from_path', 'from_stream', 'header', 'parsed_samples', 'parser', 'path', 'record_checks', 'stream', 'tabix_file', 'tabix_iter', 'tabix_path']
This is a much shorter list. You can see the online documentation for the object here https://vcfpy.readthedocs.io/en/stable/api_io.html#vcfpy-reader
Pretty much all objects and packages will have some online documentation so you can usually find something. Sometimes it’s a bit sparse, unfortunately.
PyVCF
, calculate how many of the records are SNPs and how many are Indels. Use the documentation to see how you might achieve this - https://vcfpy.readthedocs.io/en/stable/api_record.html#vcfpy-record