Reference Sequences (NCBI)

In my obsession to create a genus level tree I stumbled across the NCBI RefSeq. This is a high quality 16S dataset from NCBI.

You can access it from an FTP site or at the search bar.

From the NCBI search bar grab the pull down menu and select nucleotide. In the search bar enter the following.

33175[BioProject]

You will get ~7000+ sequences.

I downloaded the whole set. First change the fasta headers to a genus only header:

>Some_genus

Then I wrote this code in python (requires BioPython)

from Bio import SeqIO

def genus_level(fasta_file):

sequences={}
for seq_record in SeqIO.parse(fasta_file, “fasta”):
sequence=seq_record.id
if sequence not in sequences:
sequences[sequence]=str(seq_record.seq)

output_file=open(“clear_”+fasta_file,”w+”)
out_genera=open(“genera_list”+fasta_file,”w+”)

for sequence in sequences:
output_file.write(“>”+sequence+”\n”+sequences[sequence]+”\n”)
out_genera.write(sequence+”\n”)
output_file.close()
out_genera.close()

#Call the f(x) like this:
#genus_level(“my_fasta.fasta”)

Run that on the fasta and BOOM!!!! Genus level fasta file of the RefSeq database. I will get the code cleaned up in the next day or so. I will add in the renaming the fasta header and some filtering options (like scanning for short sequences and so on).

FYI if you notice one of the files kicked out is just a list of genera for you to play with.

Advertisements
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s