One of the many parts to this project is comparing phylogenetic reconstructed trees to the NCBI taxonomic database. The main reason to do to this is to ensure that you have a complete set of 16S data. Comparing small data sets is not a problem but when you start getting hundreds and thousands of 16S sequences you might not want to this by hand. Biopython has a neat set of tools for this.
Tools you will need:
Spreadsheet with the NCBI taxonomy list
ITOL account (makes pretty tree and is really easy to upload data sets)
You will need to get a list (perhaps by genera) of some bacteria you are interested in from NCBI. Then go and find ever 16S sequence you can from NCBI or greengenes to match you list. Run an alignment and build a tree. Typically with a small, well aligned data set you won’t be missing anything. But….. with large data sets and multiple people working on building 16S files you can expect mistakes. And certain phylogenetic reconstructions (Neighbor joining I am pointing my finger at you) will toss out sequences that don’t fit with the other data.
One way to check to see who all survived your tree making is by running this little script:
from Bio import Phylo
tree = Phylo.read(“combine_ITOL_and_NCBI.tree”,”newick”)
for node in tree.get_nonterminals():print node.name
This will get you a dump of your tree in a format you can paste into an excel sheet to compare with your master NCBI list. Again this is ok for a small number of sequences. What you want be able to do is pull the list from your tree, compare it to the NCBI and then print out who is missing from either list.
This can be done with the unix command: diff . The idea would be to roll the unix diff command into the python script along with a pipe command so you don’t have to enter file names (well, except once)
I’ll post the code in awhile.
If all the media making goes well I’ll post some videos and such showing some fieldwork followed by other fun stuff.