NOTE: This is a repost of an entry that I wrote for the molecularecologist.com.
This weekend, I was doing a little work on one of our projects where we are using various cpDNA genes. I really needed to get a number of protein sequences from Genbank for the products of chloroplast genes. If I had a few extra hours on hand and felt like driving myself insane, I could have poked and prodded my way through the various portals to NCBI and used my browser to extract the necessary information from the depths of Genbank et al.
But, for the most part, I've got better things to do. And, the web-interface to Genbank can be maddeningly inconsistent across the various databases. So, a reasonable approach to the problem is to gather the data from NCBI programatically.
Here's what I had to start:
A list of fasta headers from cpDNAs I extracted from Genbank, like so:
A list of the genes in which I was interested:
Now, here's what I want to do... for each record in
species.txt, I want to get the protein sequence of the particular genes in
genes.txt. I chose to do this in python, because it is my language of choice (good introductions to the language are here and here, and additional sources can be found in this list). In addition to the standard python library, I will also use the biopython library, which provides functions to interact with NCBI. What follows assumes you have some familiarity with python (see above), although I've included lots of comments in the code.
Now, before we get started with the code, some words of caution: NCBI has strict policies regarding programmatic access to NCBI resources. Make SURE you follow these rules or your IP address will be banned, which you don't want.
we need to import the modules we are going to use and set our email address to something useful so NCBI can contact us if we are hitting their servers too hard:
we also need to parse each
species.txtfile, to give us a list of the species in which we're interested:
and, we need to parse the
gene.txtfile, above, to give us the list of genes we want:
Now, to the business end of things. For each gene, for each species, go to NCBI and get the data we're after:
Putting it all together, we get the following, which is also available over at github: