In my obsession to create a genus level tree I stumbled across the NCBI RefSeq. This is a high quality 16S dataset from NCBI.
You can access it from an FTP site or at the search bar.
From the NCBI search bar grab the pull down menu and select nucleotide. In the search bar enter the following.
You will get ~7000+ sequences.
I downloaded the whole set. First change the fasta headers to a genus only header:
Then I wrote this code in python (requires BioPython)
from Bio import SeqIO
for seq_record in SeqIO.parse(fasta_file, “fasta”):
if sequence not in sequences:
for sequence in sequences:
#Call the f(x) like this:
Run that on the fasta and BOOM!!!! Genus level fasta file of the RefSeq database. I will get the code cleaned up in the next day or so. I will add in the renaming the fasta header and some filtering options (like scanning for short sequences and so on).
FYI if you notice one of the files kicked out is just a list of genera for you to play with.