Get your protein

NOTE: This is a repost of an entry that I wrote for the molecularecologist.com. This weekend, I was doing a little work on one of our projects where we are using various cpDNA genes. I really needed to get a number of protein sequences from Genbank for the products…

Getting taxonomy information from NCBI

Sometimes you need to get taxonomy information from NCBI, assuming that you know a particular species name. If you only are working with one species, then this is not very hard. When it comes to working with multiple species, however, attempting such a task using the web-frontend would be painful…

Casting a numpy array of strings to int

Sometimes you need to create an array from a string, and then you need to cast the array (which is of string type) into something more useful like int - for example when reading PHRED quality scores from a file. You can do this several ways, often using a list…

Chunking a fasta file, part 2

Well, it took me more time than I had planned to get around to wrapping this up... but, it is what it is. I have completed some code that will use single- or multiple-processes to split a fasta or fastq file into a requested number of subunits. I have yet…

An alternative method to run colony2

For parentage/sibship inference, I've started using Colony2 and MasterBayes in place of the venerable Cervus. However, one (of two*) things that is annoying about Colony2 are the scripts for running the program that are available on "alternative" operating systems (e.g. not Windows). These scripts are provided as part…

Chunking a fasta file, part 1

I've been thinking for a little while about how best to chunk up a gigantic fasta file for distribution across several machines (more on that later). There are obviously several ways to do this - one of which would just be to sequentially read x number of fasta entries (say…